Single Cell Nucleic Acid Detection and Analysis

ABSTRACT

Methods and compositions for digital profiling of nucleic acid sequences present in a sample are provided.

RELATED APPLICATION DATA

This application claims priority to U.S. Provisional Patent ApplicationNo. 61/467,037, filed on Mar. 24, 2011 and U.S. Provisional PatentApplication No. 61/583,787, filed on Jan. 6, 2012, each of which ishereby incorporated by reference in their entireties for all purposes.

STATEMENT OF GOVERNMENT INTERESTS

This invention was made with government support under HG005097-01 and1RC2HG005613-01 from the National Human Genome Research Institute. TheGovernment has certain rights in the invention.

BACKGROUND

1. Field of the Invention

Embodiments of the present invention relate in general to methods andcompositions for determining the expression profile of one or more(e.g., a plurality) nucleic acid sequences in a sample.

2. Description of Related Art

Although an organism's cells contain largely identical copies of genomicDNA, the RNA expression levels vary widely from cell to cell. RNAexpression profiling can be measured. See Pohl and Shih (2004) ExpertRev. Mol. Diagn. 4:41 and Wang et al., Nature Reviews 10, 57-63 (2009),WO2007/076128, US2007/0161031 and WO2007/117620. Expression profilinghas also been applied to single cells. See Tang et al., Nature Methods,6, 377-382 (2009) and Tang et al., Cell Stem Cell, 6, 468-478 (2010).However, many nucleic acids exist at low copy numbers in a cell makingdetection difficult and a complete expression level profile involves anetwork containing thousands of genes. Therefore, a technique forgenome-wide, digital quantification of nucleic acid molecules with ahigh dynamic range and single molecule sensitivity is needed to answerfundamental biological questions.

SUMMARY

Embodiments of the present disclosure are directed to a method ofidentifying target molecules in a sample, such as a plurality of targetmolecules from a single cell, using unique barcode sequences. Accordingto one aspect, target molecules include nucleic acid molecules such asDNA or RNA, for example cDNA or mRNA. According to one aspect, a sampleis provided including a plurality of nucleic acid molecules, such as DNAor RNA molecules, for example cDNA or mRNA. A nucleic acid molecule inthe sample is tagged or labeled with its own unique barcode sequence,i.e. one unique barcode sequence for that particular nucleic acidmolecule in the sample, or a combination of two or more barcodesequences providing a unique total barcode sequence for that particularnucleic acid molecule in the sample. The terms unique barcode sequenceand unique total barcode sequence are used interchangeably herein.Additional nucleic acid molecules in the sample are also tagged orlabeled with their own unique barcode sequences, i.e. one unique barcodesequence for that particular nucleic acid molecule in the sample, or acombination of two or more barcode sequences providing a unique totalbarcode sequence for that particular nucleic acid molecule in thesample. In this manner, tagged nucleic acid molecules in the sample havedifferent barcode sequences, regardless of whether the individualnucleic acid molecules have the same or a different sequence. Aplurality of nucleic acid molecules having the same sequence arereferred to herein as having a copy number. According to one aspect, thepercentage of individual nucleic acid molecules having their own uniquebarcode sequence is greater than 80%, greater than 90%, greater than95%, greater than 96%, greater than 97%, greater than 98%, greater than99%, greater than 99.1%, greater than 99.2%, greater than 99.3%, greaterthan 99.4%, greater than 99.5%, greater than 99.6%, greater than 99.7%,greater than 99.8%, or greater than 99.9%. According to one aspect, eachnucleic acid molecule, such as each molecule of RNA, such as mRNA, orDNA, such as cDNA in the sample has its own unique barcode sequence,such that 100% of the nucleic acid molecules in the sample have theirown unique barcode sequence. The tagged nucleic acid molecules withtheir own unique barcode sequences are then amplified in the case ofDNA, such as cDNA, or reverse transcribed into corresponding cDNA.According to one aspect, each cDNA includes a barcode sequence unique tothe corresponding RNA from which it was transcribed. Each cDNA is thenamplified to produce amplicons of the cDNA. Each amplicon includes thebarcode sequence from the particular cDNA that was amplified to producethe amplicons. The amplicons are then sequenced whether produced fromDNA or RNA and the barcodes are identified. The number of different orunique barcode sequences equates to the number of DNA or cDNA moleculesthat were amplified and, accordingly, in the case of cDNA, the number ofRNA molecules uniquely tagged in the original sample.

According to one aspect, methods and compositions are provided fordigital counting of nucleic acids such as RNA and/or DNA with a highdynamic range by using a nucleic acid tag or copy number barcode (CNB)in combination with DNA sequencing, such as a massively parallelsequencing applied to genome-wide RNA expression profiling as describedin Wang et al. (2009) Nat. Rev. 10:57 hereby incorporated by referenceherein in its entirety for all purposes. The methods and compositionsdescribed herein reduce or eliminate the amplification bias thattypically prevents high sensitivity counting of RNA and/or DNA moleculesin a sample.

In an alternate embodiment, a method of identifying the total number ofnucleic acid molecules, such as DNA molecules in a sample, such as cDNA,or RNA molecules in a sample, such as a plurality of RNA molecules froma single cell, using unique barcode sequences is provided. According toone aspect, each molecule of DNA or RNA in a sample is tagged or labeledwith its own unique barcode sequence, i.e. one unique barcode sequencefor the molecule of DNA or RNA in the sample. In this manner, no taggedDNA or RNA in the sample has the same barcode sequence, regardless ofwhether DNA or RNA molecules are the same or different. This aspect ofthe present disclosure utilizes a sufficient number of unique candidatebarcode sequences relative to the number of DNA or RNA molecules in asample such that probability ensures that each DNA or RNA molecule willbe tagged with a unique barcode sequence. One of skill in the art willrecognize a finite probability of two DNA or RNA molecules being taggedwith the same barcode sequence, but the number of candidate barcodesequences is selected such that the chances of this occurring is asinfinitesimally small as possible. For RNA, each RNA with its own uniquebarcode sequence is then reverse transcribed into cDNA. Each cDNAincludes a barcode sequence unique to the RNA from which it wastranscribed. Each cDNA is then amplified to produce amplicons of thecDNA. Each amplicon includes the barcode sequence from the particularcDNA that was amplified to produce the amplicons. The amplicons are thensequenced and the barcodes are identified. The number of different orunique barcode sequences equates to the number of cDNA molecules thatwere amplified and, accordingly, the number of RNA molecules present inthe original sample. This aspect of the present disclosure provides anelegant method for identifying the total number of RNA molecules in asample. This aspect also applies to identifying the total number ofnucleic acids such as DNA in a sample such as a single cell where one ofskill in the art will recognize that reverse transcription steps are notrequired when the nucleic acid is DNA. For example, cDNA is tagged withits own unique barcode sequence. Each cDNA is then amplified to produceamplicons of the cDNA. Each amplicon includes the barcode sequence fromthe particular cDNA that was amplified to produce the amplicons. Theamplicons are then sequenced and the barcodes are identified. The numberof different or unique barcode sequences equates to the number of cDNAmolecules that were amplified.

According to aspects of the present disclosure, a method is provided foridentifying low copy number nucleic acids in a sample, such as aplurality of DNA and/or RNA molecules from a single cell. According toone aspect, each molecule of RNA in a sample is tagged or labeled withits own unique barcode sequence, i.e. one unique barcode sequence forthe molecule of RNA in the sample. Because each molecule of RNA istagged with its own unique barcode sequence regardless of copy numberfor a particular RNA sequence, reverse transcription and amplificationwill reveal RNA of a particular sequence having low copy number, as thesequencing and reading of barcodes is independent of the copy number ofthe original RNA sequence. This aspect of the present disclosureprovides an elegant method for revealing the presence of low copy numberRNA in a cell where the RNA was previously unknown or undiscovered. Thisaspect also applies to identifying the total number of nucleic acidssuch as DNA in a sample such as a single cell where one of skill in theart will recognize that reverse transcription steps are not requiredwhen the nucleic acid is DNA. For example, cDNA in low copy number in asample is tagged with its own unique barcode sequence. Each cDNA is thenamplified to produce amplicons of the cDNA. Each amplicon includes thebarcode sequence from the particular cDNA that was amplified to producethe amplicons. The amplicons are then sequenced and the barcodes areidentified. The number of different or unique barcode sequences equatesto the number of cDNA molecules that were amplified.

According to certain aspects of the disclosure, a linearpre-amplification method is provided where RNA tagged with its ownunique barcode is repeatedly reverse transcribed into multiple copies ofcDNA with the unique barcode. The multiple copies of cDNA with theunique barcode may then be amplified to produce the amplicons. Theamplicons are then sequenced and the barcodes are identified. As withthe other described methods, the number of different or unique barcodesequences ultimately equates to the number of RNA molecules present inthe original sample. This aspect of providing multiple copies of cDNAwith the unique barcode provides an elegant method for increasing thepopulace of cDNA molecules corresponding to a particular RNA moleculefrom one to between about 2 and about 1000 copies or between about 5 toabout 100 copies, which improves the efficiency of amplification anddetection of the unique barcode sequence, thereby resulting in anaccurate count of the number to RNA molecules in a sample. According tocertain aspects, linear pre-amplification can be used in a method whereDNA is tagged with its own unique barcode. According to this aspect,repeated replication by a DNA polymerase is used for increasing the copynumber of a barcoded DNA molecule from between about 2 to about 1000copies or between about 5 to about 100 copies.

According to certain aspects of the disclosure, a method of makingprimers personalized to the RNA of a particular cell is provided.According to this aspect, genomic DNA obtained from a cell is fragmentedinto lengths of between about 5 bases to about 50 bases, between about10 bases to about 30 bases, or between about 15 bases to about 20 bases.The fragmented genomic DNA is then used as primers for the reversetranscription of RNA from the same species of cell. According to thisaspect of the disclosure, the use of a genomic primer pool to reversetranscribe RNA from the same species of cell increases the efficiency ofreverse transcription of RNA to cDNA.

According to an aspect of the present disclosure, a method ofdetermining copy number of a nucleic acid molecule in a sample includinga plurality of nucleic acid molecules is provided. According to anaspect, the method includes attaching a unique barcode sequence or aunique barcode sequence-primer conjugate or a unique barcodesequence-adapter conjugate to substantially each of the plurality ofnucleic acid molecules in the sample to produce a plurality of barcodednucleic acid molecules, amplifying the plurality of barcoded nucleicacid molecules in the sample to produce amplicons of the plurality ofbarcoded nucleic acid molecules, sequencing each amplicon to identify anassociated nucleic acid sequence and an associated barcode sequence,selecting a first target nucleic acid sequence and determining thenumber of unique associated barcode sequences for the first targetnucleic acid sequence, wherein the number of unique associated barcodesequences is the copy number of the first target nucleic acid sequence.According to an aspect, the nucleic acid molecules are DNA or RNA.According to an aspect, the plurality of barcoded nucleic acid moleculesis a plurality of barcoded RNA molecules and further including the stepsof reverse transcribing the plurality of barcoded RNA molecules toproduce barcoded cDNA molecules and amplifying the barcoded cDNAmolecules to produce amplicons of the barcoded cDNA molecules. Accordingto an aspect, a step is provided of repeatedly reverse transcribing theplurality of barcoded RNA molecules to produce linear pre-amplifiedbarcoded cDNA molecules and amplifying the linear pre-amplified barcodedcDNA molecules to produce amplicons of the linear pre-amplified barcodedcDNA molecules. According to an aspect, the step of repeatedly reversetranscribing the plurality of barcoded RNA molecules includes usingreverse transcriptase and a nicking enzyme. According to an aspect, theplurality of barcoded nucleic acid molecules is a plurality of barcodedDNA molecules and further including the steps of repeated replication ofthe plurality of barcoded DNA molecules to produce a plurality ofpre-amplified barcoded DNA molecules and amplifying the plurality ofpre-amplified barcoded DNA molecules to produce amplicons of theplurality of pre-amplified barcoded DNA molecules. According to anaspect, the step of repeated replication of the plurality of barcodedDNA molecules includes using DNA polymerase and a nicking enzyme.According to an aspect, the sample is obtained from one or more cells ofa first cell type and wherein the primer of the unique barcodesequence-primer conjugate is generated from genomic DNA of the firstcell type.

According to an aspect of the present disclosure, a method of countingnucleic acid molecules in a sample including a plurality of nucleic acidmolecules is provided. According to an aspect, the method includesattaching a unique barcode sequence or a unique barcode sequence-primerconjugate or a unique barcode sequence-adapter conjugate tosubstantially each of the plurality of nucleic acid molecules in thesample to produce a plurality of barcoded nucleic acid molecules,amplifying the plurality of barcoded nucleic acid molecules in thesample to produce amplicons of the plurality of barcoded nucleic acidmolecules, sequencing each amplicon to identify an associated barcodesequence, and counting the number of unique associated barcode sequencesas a measure of the number of nucleic acid molecules in the sample.According to an aspect, the nucleic acid molecules are DNA or RNA.According to an aspect, the plurality of barcoded nucleic acid moleculesis a plurality of barcoded RNA molecules and further including the stepsof reverse transcribing the plurality of barcoded RNA molecules toproduce barcoded cDNA molecules and amplifying the plurality of barcodedcDNA molecules to produce amplicons of the plurality of barcoded cDNAmolecules. According to an aspect, a step is provided of repeatedlyreverse transcribing the plurality of barcoded RNA molecules to producelinear pre-amplified barcoded cDNA molecules and amplifying the linearpre-amplified barcoded cDNA molecules to produce amplicons of the linearpre-amplified barcoded cDNA molecules. According to an aspect, the stepof repeatedly reverse transcribing the plurality of barcoded RNAmolecules includes using reverse transcriptase and a nicking enzyme.According to an aspect, the plurality of barcoded nucleic acid moleculesis a plurality of barcoded DNA molecules and further including the stepsof repeated replication of the plurality of barcoded DNA molecules toproduce a plurality of pre-amplified barcoded DNA molecules andamplifying the plurality of pre-amplified barcoded DNA molecules toproduce amplicons of the plurality of pre-amplified barcoded DNAmolecules. According to an aspect, the step of repeated replication ofthe plurality of barcoded DNA molecules includes using DNA polymeraseand a nicking enzyme. According to an aspect, the sample is obtainedfrom one or more cells of a first cell type and wherein the primer ofthe unique barcode sequence-primer conjugate is generated from genomicDNA of the first cell type.

According to an aspect of the present disclosure, a method ofdetermining copy numbers of nucleic acid molecules in a sample isprovided. According to an aspect, the method includes attaching a uniquebarcode sequence or a unique barcode sequence-primer conjugate or aunique barcode sequence-adapter conjugate to substantially each of thenucleic acid molecules in the sample to produce a plurality of barcodednucleic acid molecules, amplifying the plurality of barcoded nucleicacid molecules in the sample to produce amplicons of the plurality ofbarcoded nucleic acid molecules, massively parallel sequencing theamplicons of the plurality of barcoded nucleic acid molecules toidentify for each amplicon an associated nucleic acid sequence and anassociated barcode sequence, and determining the number of uniqueassociated barcode sequences for each nucleic acid sequence in thesample. According to an aspect, the nucleic acid molecules are DNA orRNA. According to an aspect, the plurality of barcoded nucleic acidmolecules is a plurality of barcoded RNA molecules and further includingthe steps of reverse transcribing the plurality of barcoded RNAmolecules to produce barcoded cDNA molecules and amplifying theplurality of barcoded cDNA molecules to produce amplicons of theplurality of barcoded cDNA molecules. According to an aspect, a step isprovided of repeatedly reverse transcribing the plurality of barcodedRNA molecules to produce linear pre-amplified barcoded cDNA moleculesand amplifying the linear pre-amplified barcoded cDNA molecules toproduce amplicons of the linear pre-amplified barcoded cDNA molecules.According to an aspect, the step of repeatedly reverse transcribing theplurality of barcoded RNA molecules includes using reverse transcriptaseand a nicking enzyme. According to an aspect, the plurality of barcodednucleic acid molecules is a plurality of barcoded DNA molecules andfurther including the steps of repeated replication of the plurality ofbarcoded DNA molecules to produce a plurality of pre-amplified barcodedDNA molecules and amplifying the plurality of pre-amplified barcoded DNAmolecules to produce amplicons of the plurality of pre-amplifiedbarcoded DNA molecules. According to an aspect, the step of repeatedreplication of the plurality of barcoded DNA molecules includes usingDNA polymerase and a nicking enzyme. According to an aspect, the sampleis obtained from one or more cells of a first cell type and wherein theprimer of the unique barcode sequence-primer conjugate is generated fromgenomic DNA of the first cell type.

According to one aspect, one or more barcodes may be attached to anucleic acid such as DNA or RNA. The one or more barcodes may beattached at any location within the nucleic acid. The one or morebarcodes may be attached to either end of the nucleic acid. The one ormore barcodes may be attached to each end of the nucleic acid. The oneor more barcodes may be attached in tandem or in series to the nucleicacid at one or both ends of the nucleic acid or within the nucleic acid.

According to one aspect, the methods described herein may utilizebarcodes that are random sequences or systematically designed sequencesas are known in the art. Additionally, the methods described herein mayutilize barcodes resulting from an optimization protocol, system ormethod. In such an optimization protocol, system or method, barcodes aredesigned or selected such that they are not within a certain distanceand are sufficient to maintain uniqueness should a barcode sequence bealtered during the amplification and sequencing methods or otherenzymatic reactions described herein and known to those of skill in theart. For a given barcode length and number of barcodes in a set, the“distance” refers to the number of times a barcode can change anucleotide before becoming identical to another barcode in the set. Forexample, for a two member barcode set of AAA and AAT, the distance wouldbe 1 because AAT need change only 1 nucleotide, i.e. the T to an A, tobecome identical with AAA. Likewise AAA need change only 1 nucleotide,i.e. the A to a T, to become identical with AAT. For example, if theselected distance were 9, then the members of the optimized set wouldhave a distance greater than 9, as a barcode with a distance of 9, if 9nucleic acids were changed, would result in creation of a barcodeidentical to another member of the set. An acceptable distance betweenbarcodes to maintain uniqueness for a given number of alterations isdetermined by one or more of the particular application for thebarcodes, the length of the barcode, the number of barcodes, theamplification error rate, the copy error rate from enzymatic reactionsdescribed herein, the sequencing error rate and the number of nucleicacids to be uniquely tagged.

According to an aspect of the present disclosure, barcode sequencesuseful in the methods described herein are optimized to produce anoptimized set of barcodes. The optimized set of barcodes minimizessequence-dependent bias and/or amplification noise during digital RNAsequencing. The optimized set of barcodes is characterized by a distanceof its members such that any particular member may be altered up to andincluding the selected distance and still maintain uniqueness within theoptimized set. If uniqueness were not maintained within the set, thenalterations may produce identical barcodes which may lead to a falsedata regarding the number of nucleic acid molecules in a particularsample. A set of barcodes with members having a predetermined distanceallows counting of nucleic acids within a sample with single-copyresolution despite bias from library preparation, sequence-dependentbias and amplification noise. According to one aspect, alterations whichmay happen during library preparation, amplification and sequencing donot reduce uniqueness of the set of barcodes. According to one aspect,the optimized set of barcodes reduces false original barcode reads thatmay otherwise result from sequencing errors. According to one aspect,use of the optimized set of barcodes described herein lowersamplification bias ordinarily resulting from such processes like PCRamplification. According to one aspect, one or more optimized barcodescan be ligated to DNA or RNA and amplified with minimal bias anddistinguished from one another despite the accumulation of PCR mutationsand sequencing errors.

According to one aspect, an optimized barcode set may be characterizedas one maintaining sequence uniqueness of its members to the extent ofthe predetermined distance for the members. According to an aspect, anoptimized set of barcodes is provided whereby the optimized barcodeslack significant sequence overlap with other barcodes, lack significantcomplementarity with other barcodes, lack significant overlap andcomplementarity with adapter and primer sequences used in librarypreparation and sequencing, lack significant sequence overlap andcomplementarity with sequences of the nucleic acids of interest, such asthe transcriptome of a cell, lack significantly long homopolymers, lackhigh GC-content, lack low GC-content, lack possible secondarystructures, or lack repetitive sequences.

According to an aspect, a barcode described herein may be attached totwo different locations on target nucleic acid, such as at each end of atarget nucleic acid. The two barcode sequences independently attached tothe nucleic acid along with the target molecule sequence are thendetermined using methods known to those of skill in the art, such singleread sequencing or paired-end sequencing.

According to an aspect, a set of barcodes are used as building blocks tocreate a barcode attached to a particular nucleic acid. For example, oneor more barcodes from the set may be attached to a nucleic acid givingthe nucleic acid a unique barcode. For example, two or more barcodesfrom the set may be attached to a nucleic acid giving the nucleic acid aunique barcode. In this manner, any number of barcodes within the setcan be added to a nucleic acid to provide a unique barcode. For example,a first barcode from the set can be added to each nucleic acid in asample. Then a second barcode from the set can be added to each nucleicacid in a sample. In this manner, the two barcodes combined provide aunique barcode sequence for the nucleic acid. In addition, a thirdbarcode from the set can be added to each nucleic acid in a sample. Inaddition, a fourth barcode from the set can be added to each nucleicacid in a sample, and so on up to the number of barcodes in the set. Inthis manner, the barcodes within the set are used as building blocks tocreate a unique barcode sequence of desired length for nucleic acidswithin a sample. In this manner, fewer barcode sequences may be includedin a set of unique barcodes to create unique barcode sequences fornucleic acids in a sample. For example, a set of barcodes, such as anoptimized set of barcodes described herein, including 145 barcodes witheach having a length of 20 nucleotides will produce 145×145=21,025unique barcodes if two barcodes from the set are independently attachedto a nucleic acid to create a unique barcode. According to certainaspects, attaching two barcodes to a nucleic acid, such as attaching abarcode at each end of a nucleic acid, may increase the overallrandomness of barcode sampling because in certain embodiments the twoends of the nucleic acid may be unlikely to have a similar degree ofbias. Although the use of two barcode is exemplified, it is to beunderstood that any number of barcodes from the set may be independentlyattached to a nucleic acid sequence to create a unique total barcodesequence for the nucleic acid. In this manner, barcodes from the set canbe used as individual building blocks to create a unique barcode foreach nucleic acid in a sample.

According to one aspect, a method is provided for designing an optimizedbarcode set by identifying sequences that lack significant sequenceoverlap with other barcodes, lack significant complementarity with otherbarcodes, lack significant overlap and complementarity with adapter andprimer sequences used in library preparation and sequencing, lacksignificant sequence overlap and complementarity with sequences of thenucleic acids of interest, such as the transcriptome of a cell, lacksignificantly long homopolymers, lack high GC-content, lack lowGC-content, lack possible secondary structures, or lack repetitivesequences. According to one aspect, a computer is used with commerciallyavailable software to design the optimized barcode set which can becreated using random or systematic methods known to those of skill inthe art by which barcodes meet criteria described herein.

Further features and advantages of certain embodiments of the presentdisclosure will become more fully apparent in the following descriptionof the embodiments and drawings thereof, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other features and advantages of the present inventionwill be more fully understood from the following detailed description ofillustrative embodiments taken in conjunction with the accompanyingdrawings in which:

FIG. 1 schematically depicts RNA sequencing (RNAseq) using copy numberbarcode.

FIG. 2 schematically depicts hairpin primer configurations according tocertain embodiments.

FIG. 3 schematically depicts linear amplification of cDNA using anicking enzyme with reverse transcriptase. (1) A DNA primer whichincludes a double stranded region, copy number barcode (CNB), and anicking enzyme recognition site is attached to the 3′ end of RNA. (2) Areverse transcriptase generates cDNA which includes the complementarysequence of CNB. (3) A nicking enzyme makes a nick. (4) The reversetranscriptase, which has strand displacement activity, starts togenerate the cDNA again from the nicking site. (5) The first generatedcDNA completely detaches from the template. (6) Multiple cDNAs whichhave the same CNB (complementary sequence) are generated from the sameRNA, meaning that the RNA is linearly amplified such that digitalcounting is uncompromised.

FIGS. 4A-4C depict construction of a genomic primer pool (GPP). (A) Onemethod to make GPP according to certain embodiments. An adapter whichhas a restriction enzyme recognition site is attached to both ends offragmented genomic DNA. The enzyme cuts, for example, 20 bp away fromthe recognition site which is in the genomic sequence. (B) Fragmentationof genomic DNA as a function of time. Samples were loaded on an agarosegel. Length of fragmented DNA became shorter over time. (C) Afterincubation with a restriction enzyme of PpuEI (cut at 16 bp away), BpmI(16 bp), or MmeI (20 bp) respectively, the samples were loaded on anacrylamide gel. Separated DNA showed expected lengths.

FIG. 5 illustrates that the CNB allows digital DNA counting in a singletube. Sequencing of amplified DNA without CNB (above) and with CNB(bottom) are shown. Top panel sequences are set forth as (SEQ ID NO:16).Bottom panel sequences, starting at top and going downward are set forthas (SEQ ID NO:17), (SEQ ID NO:18), (SEQ ID NO:19), (SEQ ID NO:20), (SEQID NO:21), (SEQ ID NO:22), (SEQ ID NO:23), (SEQ ID NO:24), respectively.

FIG. 6 schematically depicts the principle of digital counting with CNB.

FIG. 7 illustrates RNA counting in a single tube by using the CNBprimer. Sequencing of amplified cDNA without CNB (above) and with CNB(bottom) are shown. Top panel sequences are set forth as (SEQ ID NO:30).Bottom panel sequences, starting at top and going downward are set forthas (SEQ ID NO:31), (SEQ ID NO:32), (SEQ ID NO:33), (SEQ ID NO:34), (SEQID NO:35), (SEQ ID NO:36), (SEQ ID NO:37), (SEQ ID NO:38), respectively.

FIG. 8 is a graph depicting the calculated fraction of molecules thatare uniquely barcoded as a function of barcode length for four differentsample sizes (N=10⁵, 10⁶, 10⁷, 10⁸) based on the Poisson distribution.

FIG. 9 is a graph depicting the results of qPCR on three DNA templatesusing a common set of primers. The three templates have differentamounts of secondary structure, and are therefore amplified during qPCRwith different efficiencies. The graph shows both the relative amount ofamplification product generated after 20 cycles of qPCR and thecorresponding amplification efficiencies for the three templates.

FIG. 10A is an illustration of the digital counting method describedherein for cDNA. FIG. 10B is an illustration of one aspect of thepresent disclosure where a cDNA is tagged with one barcode at each end(i.e., two total barcodes) and the sequence is determined usingpaired-end reading.

FIG. 11A-C are graphs quantifying spike-in sequence.

FIG. 12A-C are histograms of unique barcodes from parallel simulation ofthe theoretical library using optimized barcodes and random barcodes.

FIG. 13A-B are graphs of down-sampling of all spike-in data by a factorof 10 and the resulting digital counts obtained.

FIG. 14 is a histogram of the number of bases between the centerpositions of all pairs of molecules mapped to the same transcriptionunit that contain the same barcodes for pairs of molecules both mappedto the sense or antisense strand of E. coli (light) and also for pairsof molecules mapping to different strands of E. coli genome (dark).

FIG. 15 is a histogram of the difference in fragment length for allpairs of molecules mapped to the same transcription unit that containthe same barcode for pairs of molecules both mapped to the sense orantisense strand of E. coli genome (light) and also for pairs ofmolecules mapping to different strands of the E. coli genome (dark).

FIG. 16A-D are graphs showing digital quantification of the E. colitranscriptome.

FIG. 17A-D are comparisons of uniquely mapped reads per kilobase of eachtranscription unit or gene per million total uniquely mapped reads(RPKM) and uniquely mapped digital counts per kilobase of eachtranscription unit or gene per million total uniquely mapped molecules(DPKM) for all detected transcription units.

FIG. 18A-C are graphs depicting reproducibility of digital andconventional quantification of the E. coli transcriptome.

FIG. 19A-C are graphs depicting simulated RNA expression quantification.

FIG. 20 is a depiction of a barcoding method for digital countingutilizing stochastic labeling (path B) as compared to a conventionaldilution method for digital counting (path A).

DETAILED DESCRIPTION

The present invention is based in part on the discovery of methods andcompositions for expression profiling of nucleic acid sequences, such asRNA and DNA. In certain exemplary embodiments, methods of individuallycounting RNA molecules or DNA molecules in a sample are provided. Incertain aspects, these methods include the step of attaching adistinguishable or unique barcode, referred to herein as a copy numberbarcode or CNB, to a respective nucleic acid molecule, such as a DNAmolecule in a sample, such as cDNA or such as a RNA molecule in asample, such as RNA in a cell. According to one aspect, DNA or RNAmolecules in a sample receive their own unique barcode. According to oneaspect, RNA molecules in the sample with the unique barcodes are thenreverse transcribed into cDNA. The cDNA is then amplified and thesequences of the amplified cDNA with the unique barcodes are thendetermined using methods known to those skilled in the art, such as nextgeneration sequencing methods that simultaneously determine thesequences of the cDNA molecules in a single sample. DNA, such as cDNA ina sample, can similarly be counted using unique barcode sequences.

Terms and symbols of nucleic acid chemistry, biochemistry, genetics, andmolecular biology used herein follow those of standard treatises andtexts in the field, e.g., Komberg and Baker, DNA Replication, SecondEdition (W.H. Freeman, New York, 1992); Lehninger, Biochemistry, SecondEdition (Worth Publishers, New York, 1975); Strachan and Read, HumanMolecular Genetics, Second Edition (Wiley-Liss, New York, 1999);Eckstein, editor, Oligonucleotides and Analogs: A Practical Approach(Oxford University Press, New York, 1991); Gait, editor, OligonucleotideSynthesis: A Practical Approach (IRL Press, Oxford, 1984); and the like.

As used herein, the term “nucleoside” refers to a molecule having apurine or pyrimidine base covalently linked to a ribose or deoxyribosesugar. Exemplary nucleosides include adenosine, guanosine, cytidine,uridine and thymidine. Additional exemplary nucleosides include inosine,1-methyl inosine, pseudouridine, 5,6-dihydrouridine, ribothymidine,²N-methylguanosine and ^(2,2)N,N-dimethylguanosine (also referred to as“rare” nucleosides). The term “nucleotide” refers to a nucleoside havingone or more phosphate groups joined in ester linkages to the sugarmoiety. Exemplary nucleotides include nucleoside monophosphates,diphosphates and triphosphates. The terms “polynucleotide,”“oligonucleotide” and “nucleic acid molecule” are used interchangeablyherein and refer to a polymer of nucleotides, eitherdeoxyribonucleotides or ribonucleotides, of any length joined togetherby a phosphodiester linkage between 5′ and 3′ carbon atoms.Polynucleotides can have any three-dimensional structure and can performany function, known or unknown. The following are non-limiting examplesof polynucleotides: a gene or gene fragment (for example, a probe,primer, EST or SAGE tag), exons, introns, messenger RNA (mRNA), transferRNA, ribosomal RNA, ribozymes, cDNA, recombinant polynucleotides,branched polynucleotides, plasmids, vectors, isolated DNA of anysequence, isolated RNA of any sequence, nucleic acid probes and primers.A polynucleotide can comprise modified nucleotides, such as methylatednucleotides and nucleotide analogs. The term also refers to both double-and single-stranded molecules. Unless otherwise specified or required,any embodiment of this invention that comprises a polynucleotideencompasses both the double-stranded form and each of two complementarysingle-stranded forms known or predicted to make up the double-strandedform. A polynucleotide is composed of a specific sequence of fournucleotide bases: adenine (A); cytosine (C); guanine (G); thymine (T);and uracil (U) for thymine when the polynucleotide is RNA. Thus, theterm polynucleotide sequence is the alphabetical representation of apolynucleotide molecule. This alphabetical representation can be inputinto databases in a computer having a central processing unit and usedfor bioinformatics applications such as functional genomics and homologysearching.

The terms “RNA,” “RNA molecule” and “ribonucleic acid molecule” refer toa polymer of ribonucleotides. The terms “DNA,” “DNA molecule” and“deoxyribonucleic acid molecule” refer to a polymer ofdeoxyribonucleotides. DNA and RNA can be synthesized naturally (e.g., byDNA replication or transcription of DNA, respectively). RNA can bepost-transcriptionally modified. DNA and RNA can also be chemicallysynthesized. DNA and RNA can be single-stranded (i.e., ssRNA and ssDNA,respectively) or multi-stranded (e.g., double stranded, i.e., dsRNA anddsDNA, respectively). “mRNA” or “messenger RNA” is single-stranded RNAthat specifies the amino acid sequence of one or more polypeptidechains. This information is translated during protein synthesis whenribosomes bind to the mRNA.

As used herein, the term “small interfering RNA” (“siRNA”) (alsoreferred to in the art as “short interfering RNAs”) refers to an RNA (orRNA analog) comprising between about 10-50 nucleotides (or nucleotideanalogs) which is capable of directing or mediating RNA interference. Incertain exemplary embodiments, an siRNA comprises between about 15-30nucleotides or nucleotide analogs, between about 16-25 nucleotides (ornucleotide analogs), between about 18-23 nucleotides (or nucleotideanalogs), and even between about 19-22 nucleotides (or nucleotideanalogs) (e.g., 19, 20, 21 or 22 nucleotides or nucleotide analogs). Theterm “short” siRNA refers to an siRNA comprising about 21 nucleotides(or nucleotide analogs), for example, 19, 20, 21 or 22 nucleotides. Theterm “long” siRNA refers to an siRNA comprising about 24-25 nucleotides,for example, 23, 24, 25 or 26 nucleotides. Short siRNAs may, in someinstances, include fewer than 19 nucleotides, e.g., 16, 17 or 18nucleotides, provided that the shorter siRNA retains the ability tomediate RNAi. Likewise, long siRNAs may, in some instances, include morethan 26 nucleotides, provided that the longer siRNA retains the abilityto mediate RNAi absent further processing, e.g., enzymatic processing,to a short siRNA.

The terms “nucleotide analog,” “altered nucleotide” and “modifiednucleotide” refer to a non-standard nucleotide, including non-naturallyoccurring ribonucleotides or deoxyribonucleotides. In certain exemplaryembodiments, nucleotide analogs are modified at any position so as toalter certain chemical properties of the nucleotide yet retain theability of the nucleotide analog to perform its intended function.Examples of positions of the nucleotide which may be derivitized includethe 5 position, e.g., 5-(2-amino)propyl uridine, 5-bromo uridine,5-propyne uridine, 5-propenyl uridine, etc.; the 6 position, e.g.,6-(2-amino)propyl uridine; the 8-position for adenosine and/orguanosines, e.g., 8-bromo guanosine, 8-chloro guanosine,8-fluoroguanosine, etc. Nucleotide analogs also include deazanucleotides, e.g., 7-deaza-adenosine; 0- and N-modified (e.g.,alkylated, e.g., N6-methyl adenosine, or as otherwise known in the art)nucleotides; and other heterocyclically modified nucleotide analogs suchas those described in Herdewijn, Antisense Nucleic Acid Drug Dev., 2000Aug. 10(4):297-310.

Nucleotide analogs may also comprise modifications to the sugar portionof the nucleotides. For example the 2′ OH-group may be replaced by agroup selected from H, OR, R, F, Cl, Br, I, SH, SR, NH₂, NHR, NR₂, COOR,or OR, wherein R is substituted or unsubstituted C₁-C₆ alkyl, alkenyl,alkynyl, aryl, etc. Other possible modifications include those describedin U.S. Pat. Nos. 5,858,988, and 6,291,438.

The phosphate group of the nucleotide may also be modified, e.g., bysubstituting one or more of the oxygens of the phosphate group withsulfur (e.g., phosphorothioates), or by making other substitutions whichallow the nucleotide to perform its intended function such as describedin, for example, Eckstein, Antisense Nucleic Acid Drug Dev. 2000 Apr.10(2):117-21, Rusckowski et al. Antisense Nucleic Acid Drug Dev. 2000Oct. 10(5):333-45, Stein, Antisense Nucleic Acid Drug Dev. 2001 Oct.11(5): 317-25, Vorobjev et al. Antisense Nucleic Acid Drug Dev. 2001Apr. 11(2):77-85, and U.S. Pat. No. 5,684,143. Certain of theabove-referenced modifications (e.g., phosphate group modifications)decrease the rate of hydrolysis of, for example, polynucleotidescomprising said analogs in vivo or in vitro.

As used herein, the term “isolated RNA” (e.g., “isolated mRNA”) refersto RNA molecules which are substantially free of other cellularmaterial, or culture medium when produced by recombinant techniques, orsubstantially free of chemical precursors or other chemicals whenchemically synthesized.

The term “in vitro” has its art recognized meaning, e.g., involvingpurified reagents or extracts, e.g., cell extracts. The term “in vivo”also has its art recognized meaning, e.g., involving living cells, e.g.,immortalized cells, primary cells, cell lines, and/or cells in anorganism.

As used herein, the terms “complementary” and “complementarity” are usedin reference to nucleotide sequences related by the base-pairing rules.For example, the sequence 5′-AGT-3′ is complementary to the sequence5′-ACT-3′. Complementarity can be partial or total. Partialcomplementarity occurs when one or more nucleic acid bases is notmatched according to the base pairing rules. Total or completecomplementarity between nucleic acids occurs when each and every nucleicacid base is matched with another base under the base pairing rules. Thedegree of complementarity between nucleic acid strands has significanteffects on the efficiency and strength of hybridization between nucleicacid strands.

The term “homology” when used in relation to nucleic acids refers to adegree of complementarity. There may be partial homology (i.e., partialidentity) or complete homology (i.e., complete identity). A partiallycomplementary sequence is one that at least partially inhibits acompletely complementary sequence from hybridizing to a target nucleicacid and is referred to using the functional term “substantiallyhomologous.” The inhibition of hybridization of the completelycomplementary sequence to the target sequence may be examined using ahybridization assay (Southern or Northern blot, solution hybridizationand the like) under conditions of low stringency. A substantiallyhomologous sequence or probe (i.e., an oligonucleotide which is capableof hybridizing to another oligonucleotide of interest) will compete forand inhibit the binding (i.e., the hybridization) of a completelyhomologous sequence to a target under conditions of low stringency. Thisis not to say that conditions of low stringency are such thatnon-specific binding is permitted; low stringency conditions requirethat the binding of two sequences to one another be a specific (i.e.,selective) interaction. The absence of non-specific binding may betested by the use of a second target which lacks even a partial degreeof complementarity (e.g., less than about 30% identity); in the absenceof non-specific binding the probe will not hybridize to the secondnon-complementary target.

When used in reference to a double-stranded nucleic acid sequence suchas a cDNA or genomic clone, the term “substantially homologous” refersto any probe which can hybridize to either or both strands of thedouble-stranded nucleic acid sequence under conditions of lowstringency. When used in reference to a single-stranded nucleic acidsequence, the term “substantially homologous” refers to any probe whichcan hybridize to the single-stranded nucleic acid sequence underconditions of low stringency.

The following terms are used to describe the sequence relationshipsbetween two or more polynucleotides: “reference sequence,” “sequenceidentity,” “percentage of sequence identity” and “substantial identity.”A “reference sequence” is a defined sequence used as a basis for asequence comparison; a reference sequence may be a subset of a largersequence, for example, as a segment of a full-length cDNA sequence givenin a sequence listing or may comprise a complete gene sequence.Generally, a reference sequence is at least 20 nucleotides in length,frequently at least 25 nucleotides in length, and often at least 50nucleotides in length. Since two polynucleotides may each (1) comprise asequence (i.e., a portion of the complete polynucleotide sequence) thatis similar between the two polynucleotides, and (2) may further comprisea sequence that is divergent between the two polynucleotides, sequencecomparisons between two (or more) polynucleotides are typicallyperformed by comparing sequences of the two polynucleotides over a“comparison window” to identify and compare local regions of sequencesimilarity. A “comparison window,” as used herein, refers to aconceptual segment of at least 20 contiguous nucleotide positionswherein a polynucleotide sequence may be compared to a referencesequence of at least 20 contiguous nucleotides and wherein the portionof the polynucleotide sequence in the comparison window may compriseadditions or deletions (i.e., gaps) of 20 percent or less as compared tothe reference sequence (which does not comprise additions or deletions)for optimal alignment of the two sequences. Optimal alignment ofsequences for aligning a comparison window may be conducted by the localhomology algorithm of Smith and Waterman (Smith and Waterman (1981) Adv.Appl. Math. 2:482) by the homology alignment algorithm of Needleman andWunsch (J. Mol. Biol. 48:443 (1970)), by the search for similaritymethod of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85:2444(1988)]), by computerized implementations of these algorithms (GAP,BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software PackageRelease 7.0, Genetics Computer Group, 575 Science Dr., Madison, Wis.),or by inspection, and the best alignment (i.e., resulting in the highestpercentage of homology over the comparison window) generated by thevarious methods is selected.

The term “sequence identity” means that two polynucleotide sequences areidentical (i.e., on a nucleotide-by-nucleotide basis) over the window ofcomparison. The term “percentage of sequence identity” is calculated bycomparing two optimally aligned sequences over the window of comparison,determining the number of positions at which the identical nucleic acidbase (e.g., A, T, C, G, U, or I) occurs in both sequences to yield thenumber of matched positions, dividing the number of matched positions bythe total number of positions in the window of comparison (i.e., thewindow size), and multiplying the result by 100 to yield the percentageof sequence identity. The term “substantial identity” as used hereindenotes a characteristic of a polynucleotide sequence, wherein thepolynucleotide comprises a sequence that has at least 85 percentsequence identity, preferably at least 90 to 95 percent sequenceidentity, more usually at least 99 percent sequence identity as comparedto a reference sequence over a comparison window of at least 20nucleotide positions, frequently over a window of at least 25-50nucleotides, wherein the percentage of sequence identity is calculatedby comparing the reference sequence to the polynucleotide sequence whichmay include deletions or additions which total 20 percent or less of thereference sequence over the window of comparison. The reference sequencemay be a subset of a larger sequence, for example, as a segment of thefull-length sequences of the compositions claimed in the presentinvention.

The term “hybridization” refers to the pairing of complementary nucleicacids. Hybridization and the strength of hybridization (i.e., thestrength of the association between the nucleic acids) is impacted bysuch factors as the degree of complementary between the nucleic acids,stringency of the conditions involved, the T_(m) of the formed hybrid,and the G:C ratio within the nucleic acids. A single molecule thatcontains pairing of complementary nucleic acids within its structure issaid to be “self-hybridized.”

The term “T_(m)” refers to the melting temperature of a nucleic acid.The melting temperature is the temperature at which a population ofdouble-stranded nucleic acid molecules becomes half dissociated intosingle strands. The equation for calculating the T_(m) of nucleic acidsis well known in the art. As indicated by standard references, a simpleestimate of the T_(m) value may be calculated by the equation:T_(m)=81.5+0.41(% G+C), when a nucleic acid is in aqueous solution at 1M NaCl (See, e.g., Anderson and Young, Quantitative FilterHybridization, in Nucleic Acid Hybridization (1985)). Other referencesinclude more sophisticated computations that take structural as well assequence characteristics into account for the calculation of T_(m).

The term “stringency” refers to the conditions of temperature, ionicstrength, and the presence of other compounds such as organic solvents,under which nucleic acid hybridizations are conducted.

“Low stringency conditions,” when used in reference to nucleic acidhybridization, comprise conditions equivalent to binding orhybridization at 42° C. in a solution consisting of 5×SSPE (43.8 g/lNaCl, 6.9 g/l NaH₂PO₄(H₂O) and 1.85 g/l EDTA, pH adjusted to 7.4 withNaOH), 0.1% SDS, 5×Denhardt's reagent (50×Denhardt's contains per 500ml: 5 g Ficoll (Type 400, Pharmacia), 5 g BSA (Fraction V; Sigma)) and100 μg/ml denatured salmon sperm DNA followed by washing in a solutioncomprising 5×SSPE, 0.1% SDS at 42° C. when a probe of about 500nucleotides in length is employed.

“Medium stringency conditions,” when used in reference to nucleic acidhybridization, comprise conditions equivalent to binding orhybridization at 42° C. in a solution consisting of 5×SSPE (43.8 g/lNaCl, 6.9 g/l NaH₂PO₄(H₂O) and 1.85 g/l EDTA, pH adjusted to 7.4 withNaOH), 0.5% SDS, 5×Denhardt's reagent and 100 μg/ml denatured salmonsperm DNA followed by washing in a solution comprising 1.0×SSPE, 1.0%SDS at 42° C. when a probe of about 500 nucleotides in length isemployed.

“High stringency conditions,” when used in reference to nucleic acidhybridization, comprise conditions equivalent to binding orhybridization at 42° C. in a solution consisting of 5×SSPE (43.8 g/lNaCl, 6.9 g/l NaH₂PO₄(H₂O) and 1.85 g/l EDTA, pH adjusted to 7.4 withNaOH), 0.5% SDS, 5×Denhardt's reagent and 100 μg/ml denatured salmonsperm DNA followed by washing in a solution comprising 0.1×SSPE, 1.0%SDS at 42° C. when a probe of about 500 nucleotides in length isemployed.

It is well known that numerous equivalent conditions may be employed tocomprise low stringency conditions; factors such as the length andnature (DNA, RNA, base composition) of the probe and nature of thetarget molecule (DNA, RNA, base composition, present in solution orimmobilized, etc.) and the concentration of the salts and othercomponents (e.g., the presence or absence of formamide, dextran sulfate,polyethylene glycol) are considered and the hybridization solution maybe varied to generate conditions of low stringency hybridizationdifferent from, but equivalent to, the above listed conditions. Inaddition, the art knows conditions that promote hybridization underconditions of high stringency (e.g., increasing the temperature of thehybridization and/or wash steps, the use of formamide in thehybridization solution, etc.).

In certain exemplary embodiments, and with reference to FIG. 1, cellsare identified and then a single cell or a plurality of cells areisolated. Cells within the scope of the present disclosure include anytype of cell where understanding the RNA content is considered by thoseof skill in the art to be useful. A cell according to the presentdisclosure includes a hepatocyte, oocyte, embryo, stem cell, iPS cell,ES cell, neuron, erythrocyte, melanocyte, astrocyte, germ cell,oligodendrocyte, kidney cell, leukocyte, thrombocyte, epithelial cell,adipocyte or fibroblast and the like. According to one aspect, themethods of the present invention are practiced with the cellular RNAfrom a single cell. However, according to certain aspects, the cellularRNA content from a plurality of cells may be used. A plurality of cellsincludes from about 2 to about 1,000,000 cells, about 2 to about 10cells, about 2 to about 100 cells, about 2 to about 1,000 cells, about 2to about 10,000 cells, or about 2 to about 100,000 cells.

As used herein, a “single cell” refers to one cell. Single cells usefulin the methods described herein can be obtained from a tissue ofinterest, or from a biopsy, blood sample, or cell culture. Additionally,cells from specific organs, tissues, tumors, neoplasms, or the like canbe obtained and used in the methods described herein. Furthermore, ingeneral, cells from any population can be used in the methods, such as apopulation of prokaryotic or eukaryotic single celled organismsincluding bacteria or yeast. In some aspects of the invention, themethod of preparing a collection of cDNA, i.e. a cDNA library, caninclude the step of obtaining single cells. A single cell suspension canbe obtained using standard methods known in the art including, forexample, enzymatically using trypsin or papain to digest proteinsconnecting cells in tissue samples or releasing adherent cells inculture, or mechanically separating cells in a sample. Single cells canbe placed in any suitable reaction vessel in which single cells can betreated individually. For example a 96-well plate, such that each singlecell is placed in a single well.

Methods for manipulating single cells are known in the art and includefluorescence activated cell sorting (FACS), micromanipulation and theuse of semi-automated cell pickers (e.g. the Quixell™ cell transfersystem from Stoelting Co.). Individual cells can, for example, beindividually selected based on features detectable by microscopicobservation, such as location, morphology, or reporter gene expression.

Aspects of the present disclosure include methods for identifying theRNA content and copy number for certain RNA sequences in a cell. Usingthe methods disclosed herein, the RNA content of cells can be comparedat various stages or times and the absolute amount of RNA or therelative amount of RNA can be used to diagnose or treat certaindiseases, disorders or conditions. Diseases, disorders or conditionswithin the scope of the present disclosure include cancer, Alzheimer'sdisease, Parkinson's disease, hepatitis, muscular dystrophy, psoriasis,tuberculosis, lysosome disease, ulcerative colitis, and Ehlers-DanlosSyndrome, cystic fibrosis, diabetes, hemophilia, sickle cell anemia,HIV, autoimmune diseases, Huntington's disease, ALS, and Shy-Dragesyndrome.

Once a desired cell has been identified, the cell is lysed to releasecellular contents including DNA and RNA, such as mRNA, using methodsknown to those of skill in the art. The cellular contents are containedwithin a vessel. In some aspects of the invention, cellular contents,such as mRNA, can be released from the cells by lysing the cells. Lysiscan be achieved by, for example, heating the cells, or by the use ofdetergents or other chemical methods, or by a combination of these.However, any suitable lysis method known in the art can be used. A mildlysis procedure can advantageously be used to prevent the release ofnuclear chromatin, thereby avoiding genomic contamination of the cDNAlibrary, and to minimise degradation of mRNA. For example, heating thecells at 72° C. for 2 minutes in the presence of Tween-20 is sufficientto lyse the cells while resulting in no detectable genomic contaminationfrom nuclear chromatin. Alternatively, cells can be heated to 65° C. for10 minutes in water (Esumi et al., Neurosci Res 60(4):439-51 (2008)); or70° C. for 90 seconds in PCR buffer II (Applied Biosystems) supplementedwith 0.5% NP-40 (Kurimoto et al., Nucleic Acids Res 34(5):e42 (2006));or lysis can be achieved with a protease such as Proteinase K or by theuse of chaotropic salts such as guanidine isothiocyanate (U.S.Publication No. 2007/0281313).

Nucleic acids from a cell such as DNA or RNA are isolated using methodsknown to those of skill in the art. Such methods include removing orotherwise separating genomic DNA and other cellular constituents fromRNA. Methods of removing or separating genomic DNA from RNA include theuse of DNase I.

Synthesis of cDNA from mRNA in the methods described herein can beperformed directly on cell lysates, such that a reaction mix for reversetranscription is added directly to cell lysates. Alternatively, mRNA canbe purified after its release from cells. This can help to reducemitochondrial and ribosomal contamination. mRNA purification can beachieved by any method known in the art, for example, by binding themRNA to a solid phase. Commonly used purification methods includeparamagnetic beads (e.g. Dynabeads). Alternatively, specificcontaminants, such as ribosomal RNA can be selectively removed usingaffinity purification.

The nucleic acids, such as DNA or RNA, are then combined in a vesselwith primers. The primers include unique barcode sequences and theprimers attached to nucleic acid molecules. According to one aspect,nucleic acid molecules within the vessel each have a unique barcodesequence. After primers have attached to the nucleic acid molecules,excess primers are removed from the vessel.

According to one aspect, unique barcode sequences may be attached toeach nucleic acid, i.e. DNA or RNA in a sample. Then adapters and orprimers or other reagents known to those of skill in the art may be usedas desired to reverse transcribe or amplify the nucleic acid with theunique barcode sequence, as the case may be.

cDNA is typically synthesized from mRNA by reverse transcription.Methods for synthesizing cDNA from small amounts of mRNA, including fromsingle cells, have previously been described (see Kurimoto et al.,Nucleic Acids Res 34(5):e42 (2006); Kurimoto et al., Nat Protoc2(3):739-52 (2007); and Esumi et al., Neurosci Res 60(4):439-51 (2008)).In order to generate an amplifiable cDNA, these methods introduce aprimer annealing sequence at both ends of each cDNA molecule in such away that the cDNA library can be amplified using a single primer. TheKurimoto method uses a polymerase to add a 3′ poly-A tail to the cDNAstrand, which can then be amplified using a universal oligo-T primer. Incontrast, the Esumi method uses a template switching method to introducean arbitrary sequence at the 3′ end of the cDNA, which is designed to bereverse complementary to the 3′ tail of the cDNA synthesis primer.Again, the cDNA library can be amplified by a single PCR primer.Single-primer PCR exploits the PCR suppression effect to reduce theamplification of short contaminating amplicons and primer-dimers (Dai etal., J Biotechnol 128(3):435-43 (2007)). As the two ends of eachamplicon are complementary, short amplicons will form stable hairpins,which are poor templates for PCR. This reduces the amount of truncatedcDNA and improves the yield of longer cDNA molecules.

In some aspects of the invention, the synthesis of the first strand ofthe cDNA can be directed by a cDNA synthesis primer (CDS) that includesan RNA complementary sequence (RCS). In some aspects of the invention,the RCS is at least partially complementary to one or more mRNA in anindividual mRNA sample. This allows the primer, which is typically anoligonucleotide, to hybridize to at least some mRNA in an individualmRNA sample to direct cDNA synthesis using the mRNA as a template. TheRCS can comprise oligo (dT), or be gene family-specific, such as asequence of nucleic acids present in all or a majority related genes, orcan be composed of a random sequence, such as random hexamers. To avoidthe CDS priming on itself and thus generating undesired side products, anon-self-complementary semi-random sequence can be used. For example,one letter of the genetic code can be excluded, or a more complex designcan be used while restricting the CDS to be non-self-complementary.

The RCS can also be at least partially complementary to a portion of thefirst strand of cDNA, such that it is able to direct the synthesis of asecond strand of cDNA using the first strand of the cDNA as a template.Thus, following first strand synthesis, an RNase enzyme (e.g. an enzymehaving RNaseH activity) can be added after synthesis of the first strandof cDNA to degrade the RNA strand and to permit the CDS to anneal againon the first strand to direct the synthesis of a second strand of cDNA.For example, the RCS could comprise random hexamers, or a non-selfcomplementary semi-random sequence (which minimizes self-annealing ofthe CDS).

A template-switching oligonucleotide (TSO) that includes a portion whichis at least partially complementary to a portion of the 3′ end of thefirst strand of cDNA can be added to each individual mRNA sample in themethods described herein. Such a template switching method is describedin (Esumi et al, Neurosci Res 60(4):439-51 (2008)) and allows fulllength cDNA comprising the complete 5′ end of the mRNA to besynthesized. As the terminal transferase activity of reversetranscriptase typically causes 2-5 cytosines to be incorporated at the3′ end of the first strand of cDNA synthesized from mRNA, the firststrand of cDNA can include a plurality of cytosines, or cytosineanalogues that base pair with guanosine, at its 3′ end (see U.S. Pat.No. 5,962,272). In one aspect of the invention, the first strand of cDNAcan include a 3′ portion comprising at least 2, at least 3, at least 4,at least 5 or 2, 3, 4, or 5 cytosines or cytosine analogues that basepair with guanosine. A non-limiting example of a cytosine analogue thatbase pairs with guanosine is 5-aminoallyl-2′-deoxycytidine.

In one aspect of the invention, the TSO can include a 3′ portioncomprising a plurality of guanosines or guanosine analogues that basepair with cytosine. Non-limiting examples of guanosines or guanosineanalogues useful in the methods described herein include, but are notlimited to, deoxyriboguanosine, riboguanosine, locked nucleicacid-guanosine, and peptide nucleic acid-guanosine. The guanosines canbe ribonucleosides or locked nucleic acid monomers.

A locked nucleic acid (LNA) is a modified RNA nucleotide. The ribosemoiety of an LNA nucleotide is modified with an extra bridge connectingthe 2′ oxygen and 4′ carbon. The bridge “locks” the ribose in the3′-endo (North) conformation. Some of the advantages of using LNAs inthe methods of the invention include increasing the thermal stability ofduplexes, increased target specificity and resistance from exo- andendonucleases.

A peptide nucleic acid (PNA) is an artificially synthesized polymersimilar to DNA or RNA, wherein the backbone is composed of repeatingN-(2-aminoethyl)-glycine units linked by peptide bonds. The backbone ofa PNA is substantially non-ionic under neutral conditions, in contrastto the highly charged phosphodiester backbone of naturally occurringnucleic acids. This provides two non-limiting advantages. First, the PNAbackbone exhibits improved hybridization kinetics. Secondly, PNAs havelarger changes in the melting temperature (Tm) for mismatched versusperfectly matched basepairs. DNA and RNA typically exhibit a 2-4° C.drop in Tm for an internal mismatch. With the non-ionic PNA backbone,the drop is closer to 7-9° C. This can provide for better sequencediscrimination. Similarly, due to their non-ionic nature, hybridizationof the bases attached to these backbones is relatively insensitive tosalt concentration.

A nucleic acid useful in the invention can contain a non-natural sugarmoiety in the backbone. Exemplary sugar modifications include but arenot limited to 2′ modifications such as addition of halogen, alkylsubstituted alkyl, SH, SCH₃, OCN, Cl, Br, CN, CF₃, OCF₃, SO₂CH₃, OSO₂,SO₃, CH₃, ONO₂, NO₂, N₃, NH₂, substituted silyl, and the like. Similarmodifications can also be made at other positions on the sugar,particularly the 3′ position of the sugar on the 3′ terminal nucleotideor in 2′-5′ linked oligonucleotides and the 5′ position of 5′ terminalnucleotide. Nucleic acids, nucleoside analogs or nucleotide analogshaving sugar modifications can be further modified to include areversible blocking group, peptide linked label or both. In thoseembodiments where the above-described 2′ modifications are present, thebase can have a peptide linked label.

A nucleic acid used in the invention can also include native ornon-native bases. In this regard a native deoxyribonucleic acid can haveone or more bases selected from the group consisting of adenine,thymine, cytosine or guanine and a ribonucleic acid can have one or morebases selected from the group consisting of uracil, adenine, cytosine orguanine. Exemplary non-native bases that can be included in a nucleicacid, whether having a native backbone or analog structure, include,without limitation, inosine, xathanine, hypoxathanine, isocytosine,isoguanine, 5-methylcytosine, 5-hydroxymethyl cytosine, 2-aminoadenine,6-methyl adenine, 6-methyl guanine, 2-propyl guanine, 2-propyl adenine,2-thioLiracil, 2-thiothymine, 2-thiocytosine, 15-halouracil,15-halocytosine, 5-propynyl uracil, 5-propynyl cytosine, 6-azo uracil,6-azo cytosine, 6-azo thymine, 5-uracil, 4-thiouracil, 8-halo adenine orguanine, 8-amino adenine or guanine, 8-thiol adenine or guanine,8-thioalkyl adenine or guanine, 8-hydroxyl adenine or guanine, 5-halosubstituted uracil or cytosine, 7-methylguanine, 7-methyladenine,8-azaguanine, 8-azaadenine, 7-deazaguanine, 7-deazaadenine,3-deazaguanine, 3-deazaadenine or the like. A particular embodiment canutilize isocytosine and isoguanine in a nucleic acid in order to reducenon-specific hybridization, as generally described in U.S. Pat. No.5,681,702.

A non-native base used in a nucleic acid of the invention can haveuniversal base pairing activity, wherein it is capable of base pairingwith any other naturally occurring base. Exemplary bases havinguniversal base pairing activity include 3-nitropyrrole and5-nitroindole. Other bases that can be used include those that have basepairing activity with a subset of the naturally occurring bases such asinosine, which basepairs with cytosine, adenine or uracil.

In one aspect of the invention, the TSO can include a 3′ portionincluding at least 2, at least 3, at least 4, at least 5, or 2, 3, 4, or5, or 2-5 guanosines, or guanosine analogues that base pair withcytosine. The presence of a plurality of guanosines (or guanosineanalogues that base pair with cytosine) allows the TSO to annealtransiently to the exposed cytosines at the 3′ end of the first strandof cDNA. This causes the reverse transcriptase to switch template andcontinue to synthesis a strand complementary to the TSO. In one aspectof the invention, the 3′ end of the TSO can be blocked, for example by a3′ phosphate group, to prevent the TSO from functioning as a primerduring cDNA synthesis.

In one aspect of the invention, the mRNA is released from the cells bycell lysis. If the lysis is achieved partially by heating, then the CDSand/or the TSO can be added to each individual mRNA sample during celllysis, as this will aid hybridization of the oligonucleotides. In someaspects, reverse transcriptase can be added after cell lysis to avoiddenaturation of the enzyme.

In some aspects of the invention, a tag can be incorporated into thecDNA during its synthesis. For example, the CDS and/or the TSO caninclude a tag, such as a particular nucleotide sequence, which can be atleast 4, at least 5, at least 6, at least 7, at least 8, at least 9, atleast 10, at least 15 or at least 20 nucleotides in length. For example,the tag can be a nucleotide sequence of 4-20 nucleotides in length, e.g.4, 5, 6, 7, 8, 9, 10, 15 or 20 nucleotides in length. As the tag ispresent in the CDS and/or the TSO it will be incorporated into the cDNAduring its synthesis and can therefore act as a “barcode” to identifythe cDNA. Both the CDS and the TSO can include a tag. The CDS and theTSO can each include a different tag such that the tagged cDNA samplecomprises a combination of tags. Each cDNA sample generated by the abovemethod can have a distinct tag, or a distinct combination of tags, suchthat once the tagged cDNA samples have been pooled, the tag can be usedto identify from which single cell each cDNA sample originated. Thus,each cDNA sample can be linked to a single cell, even after the taggedcDNA samples have been pooled in the methods described herein.

Before the tagged cDNA samples are pooled, synthesis of cDNA can bestopped, for example by removing or inactivating the reversetranscriptase. This prevents cDNA synthesis by reverse transcriptionfrom continuing in the pooled samples. The tagged cDNA samples canoptionally be purified before amplification, ether before or after theyare pooled.

As used herein, the term “barcode” refers to a unique oligonucleotidesequence that allows a corresponding nucleic acid base and/or nucleicacid sequence to be identified. In certain aspects, the nucleic acidbase and/or nucleic acid sequence is located at a specific position on alarger polynucleotide sequence (e.g., a polynucleotide covalentlyattached to a bead). In certain embodiments, barcodes can each have alength within a range of from 4 to 36 nucleotides, or from 6 to 30nucleotides, or from 8 to 20 nucleotides. In certain aspects, themelting temperatures of barcodes within a set are within 10° C. of oneanother, within 5° C. of one another, or within 2° C. of one another. Inother aspects, barcodes are members of a minimally cross-hybridizingset. That is, the nucleotide sequence of each member of such a set issufficiently different from that of every other member of the set thatno member can form a stable duplex with the complement of any othermember under stringent hybridization conditions. In one aspect, thenucleotide sequence of each member of a minimally cross-hybridizing setdiffers from those of every other member by at least two nucleotides.Barcode technologies are known in the art and are described in Winzeleret al. (1999) Science 285:901; Brenner (2000) Genome Biol. 1:1 Kumar etal. (2001) Nature Rev. 2:302; Giaever et al. (2004) Proc. Natl. Acad.Sci. USA 101:793; Eason et al. (2004) Proc. Natl. Acad. Sci. USA101:11046; and Brenner (2004) Genome Biol. 5:240 each incorporated byreference in their entireties.

As used herein, the term “primer” includes an oligonucleotide, eithernatural or synthetic, that is capable, upon forming a duplex with apolynucleotide template, of acting as a point of initiation of nucleicacid synthesis and being extended from its 3′ end along the template sothat an extended duplex is formed. The sequence of nucleotides addedduring the extension process are determined by the sequence of thetemplate polynucleotide. Usually primers are extended by a DNApolymerase. Primers usually have a length in the range of between 3 to36 nucleotides, also 5 to 24 nucleotides, also from 14 to 36nucleotides. Primers within the scope of the invention includeorthogonal primers, amplification primers, constructions primers and thelike. Pairs of primers can flank a sequence of interest or a set ofsequences of interest. Primers and probes can be degenerate in sequence.Primers within the scope of the present invention bind adjacent to atarget sequence. A “primer” may be considered a short polynucleotide,generally with a free 3′ —OH group that binds to a target or templatepotentially present in a sample of interest by hybridizing with thetarget, and thereafter promoting polymerization of a polynucleotidecomplementary to the target. Primers of the instant invention arecomprised of nucleotides ranging from 17 to 30 nucleotides. In oneaspect, the primer is at least 17 nucleotides, or alternatively, atleast 18 nucleotides, or alternatively, at least 19 nucleotides, oralternatively, at least 20 nucleotides, or alternatively, at least 21nucleotides, or alternatively, at least 22 nucleotides, oralternatively, at least 23 nucleotides, or alternatively, at least 24nucleotides, or alternatively, at least 25 nucleotides, oralternatively, at least 26 nucleotides, or alternatively, at least 27nucleotides, or alternatively, at least 28 nucleotides, oralternatively, at least 29 nucleotides, or alternatively, at least 30nucleotides, or alternatively at least 50 nucleotides, or alternativelyat least 75 nucleotides or alternatively at least 100 nucleotides.

According to one aspect, a number of sufficient barcode sequences isrequired to statistically ensure that each DNA or RNA in a sample hasits own unique barcode. For example, a barcode having six bases wouldgenerate 4⁶ or 4096 unique barcode sequences. According to one aspect,the number of RNA molecules is dependent upon cell type and one of skillin the art will reference readily available literature sources forestimates of the number of RNA molecules in a particular cell type.Accordingly, a statistically greater number of barcodes, orprimer-barcode combinations or adapter-barcode combinations is requiredto facilitate the matching of a unique barcode for each RNA molecule. Inembodiments where copy number barcoding is applied to single cell geneexpression profiling, a sufficient number or set of unique barcodes forunique labeling of a substantial fraction of RNA molecules in the sampleis provided. Examples of single cell RNA samples include single E. colicells which contain ˜260,000 RNA molecules (Neidhardt et al.,Escherichia coli and Salmonella: Cell. Mol. Bio., Vol. 1, pp. 13-16,1996), embryonic stem cells which contain ˜10 pg mRNA/cell (˜10⁷molecules) (Tang et al., Nat. Protocols, 5, 517, 2010), and neuronswhich contain ˜50 pg mRNA/cell (5×10⁷ molecules) (Uemura et al.,Experimental Neurobiology, 65, 107, 1979).

The barcoding process can be modeled analytically to calculate thebarcode length required to uniquely barcode a substantial fraction of asample of target nucleic acid molecules. A substantial fraction orpercentage includes 80%, 90%, 95%, 96%, 97%, 98%, 99%, 99.5%, 99.9% andhigher. According to one aspect, a pool of random barcode sequences isprovided. Consider a stock pool of barcodes with length L containingmultiple copies of the 4^(L) barcodes. According to one embodiment, inorder to barcode a sample of N target nucleic acid molecules, a samplecontaining M barcode molecules of random sequence is taken from a largebarcode pool and added to the sample of N target nucleic acid moleculessuch that M>N. Because the identity of the M barcode molecules israndom, the copy number distribution of the barcode sequences added tothe sample is given by the Poisson Distribution whereP(k)=(M/4^(L))^(k)e^(−M/4̂L)/k! where k is the copy number. A randomsubset of these M barcode molecules will then label the N target nucleicacid molecules, a process which is also described by the PoissonDistribution. Hence, the distribution of target nucleic acid moleculesfor which a j−1 other target nucleic acid molecules share the samebarcode sequence is given by P(j)=(N/4^(L))^(j)e^(−N/4̂L)/j! FIG. 8 showsa plot of the calculated fraction of molecules that are uniquelybarcoded as a function of barcode length for four different sample sizes(N=10⁵, 10⁶, 10⁷, 10⁸). Even for large samples containing 10⁸ targetnucleic acid molecules (similar to the case in which the RNA from ahuman cell is to be barcoded), a barcode length of L=18 provides aunique barcode sequence for >99.9% of target molecules According to oneembodiment, only target molecules with identical sequences need to beuniquely barcoded in order to count the number of target moleculespresent, as target molecules having a unique sequence can be identifiedbased on their unique nucleic acid sequence and do not necessarily needa unique barcode sequence.

Barcodes within the scope of the present disclosure include nucleicacids of between about 3 and about 75 bases, about 5 and about 60 bases,about 6 and about 50 bases, about 10 and about 40 bases, or about 15 andabout 30 bases. Barcodes can be comprised of any nucleic acid includingDNA, RNA, and LNA. Random or pseudo-random barcode sequences can besynthesized using automated DNA synthesis technology that is known inthe art (Horvath et al., Methods Enzymol., 153, 314-326, 1987). Severalcommercially available services exist for the synthesis of randomly orpseudo-randomly barcoded nucleic acid oligonucleotides includingIntegrated DNA Technologies, Invitrogen, and TriLink Biotechnologies. Inaddition, there are enzymatic methods of synthesizing single-strandednucleic acids. For example, non-template-directed enzymatic synthesis ofnucleic acids can be carried out with poly(U) polymerase (for RNA) orterminal transferase (for DNA). For non-random barcode sets, methods ofhighly parallel DNA synthesis of specific, pre-determined DNA oligos arealso known in the art such as the maskless array synthesis technologycommercialized by Nimblegen. All of these methods are capable of addinga random, pseudo-random, or non-random barcode to a pre-determinedsequence such as a primer or adapter.

According to one aspect, a copy number barcode is 5′-conjugated to aprimer such as a poly-T primer. The barcode of the present disclosure isreferred to as a copy number barcode because the use of unique barcodesequences allows one of skill to determine the total number of nucleicacids within the sample and also the copy number of nucleic acids withina sample. A barcode sequence can be conjugated to a primer using methodsknown to those of skill in the art. Such methods include the combinationof automated, random and directed nucleic acid synthesis where therandom component generates a barcode and the directed componentgenerates a specific primer (Horvath, S. J., Firca, J. R., Hunkapiller,T., Hunkapiller, M. W., Hood, L. An automated DNA synthesizer employingdeoxynucleoside 3′-phosphoramidites. Methods Enzymol., 153, 314-326,1987), single-stranded ligation (Tessier, D. C., Brousseau, R., Vernet,T. Ligation of single-stranded oligodeoxyribonucleotides by T4 RNAligase, Anal. Biochem., 158, 171-178, 1986.), and double-strandedligation (Meyer, M., Stenzel, U., Myles, S., Prufer, K., Hofreiter, M.,Targeted high-throughput sequencing of tagged nucleic acid samples.Nucl. Acids Res., 35, e97, 2007).

According to one aspect, a copy number barcode is conjugated to ahairpin primer. A variety of commercially available enzymes can be usedto ligate DNA to RNA including CircLigase (Epicentre), CircLigase II(Epicentre), T4 RNA Ligase 1 (New England Biolabs), and T4 RNA Ligase 2(New England Biolabs), allowing the ligation of specific DNA hairpinprimers, as shown in FIG. 2, to the 3′ end of RNA for reversetranscription. In certain exemplary embodiments, DNA hairpin primersinclude a self-complementary region that folds into double-stranded DNAand serves as a DNA primer for reverse transcriptase. The DNA hairpincan also optionally include other features such as recognition sequencesfor enzymes that digest double-stranded DNA, copy number barcodes,sample barcodes, PCR adapters and the like. Because reversetranscriptase can replicate both DNA and RNA, the hairpin can include asingle-stranded DNA 5′ overhang with any additional sequence content.

As shown in FIG. 2-1, a hairpin primer includes a hairpin for primingreverse transcription on its 3′ end along with a copy number barcode anda sample barcode on its 5′ end. However, a hairpin primer could alsoinclude PCR adapters on the 5′ end as shown in FIG. 2-2. PCR adaptersshould be compatible with the library preparation protocols employed inan appropriate massively parallel sequencing platform. The inclusion oftwo PCR adapters eliminates downstream ligation steps that are otherwisetypically needed for amplification and sequencing library preparation.Following RNA/DNA ligation and reverse transcription, the resultant cDNAcan be circularized using, e.g., CircLigase (Epicentre), CircLigase II(Epicentre) or the like, allowing exponential amplification of the cDNAlibrary via PCR. Alternatively, if circularization is undesirable, thehairpin primer can be designed to include only one of two PCR adapters.A subsequent ligation step can then be included to attach a second PCRadapter to the 3′ end of cDNA.

As shown in FIG. 1, a linear pre-amplification step is provided wheremultiple copies of cDNA are made by repeated reverse transcription ofthe RNA with the primer-barcode conjugate. This method is useful whenvery low concentrations of the nucleic acids of interest are available(e.g. in single cell or diagnostic applications). Inefficiencies due tomaterial loss can be mitigated without introducing bias by incorporatingthe linear pre-amplification step into the reverse transcriptionprocess.

As shown in FIG. 2-3, by including a recognition sequence for a nickingenzyme in the hairpin primer, multiple copies of cDNA can be generatedby the inclusion of the corresponding nicking enzyme in the reversetranscriptase reaction mixture. Before reverse transcribing the ligatedRNA into cDNA, reverse transcriptase will convert the single-strandedDNA recognition sequence into double-stranded DNA, allowing the nickingenzyme to generate a nick at the recognition site as shown in FIG. 3.Reverse transcriptase can then use the nick as a priming site foradditional replication because its strand-displacement activity allowsthe removal of the most recently generated cDNA copy from the RNA.Repeated cycles of nicking and reverse transcription results in linearamplification of RNA such that multiple copies of each barcoded cDNA aregenerated. The resultant cDNA library can be circularized and amplifiedusing PCR. The same method of linear pre-amplification can be applied toa sample of uniquely barcoded DNA molecules. Instead of repeatedlyreverse transcribing the barcoded DNA, repeated cycles of nicking andDNA replication by a strand-displacing DNA polymerase will result inlinear amplification of DNA such that multiple copies of barcoded DNAare generated. Strand-displacing DNA polymerases are known in the artand include φ29 DNA polymerase, Klenow Fragment DNA polymerase, BstLarge Fragment DNA polymerase, Vent DNA polymerase, Deep Vent DNApolymerase, and 9°N DNA polymerase.

Because an overwhelming amount of hairpin primer is included in certainof the ligation reactions described herein, it may be important toremove excess hairpin primer prior to PCR amplification. Acircularization reaction performed prior to PCR results not only incircularized cDNA, but also in circularization of excess, linearlyamplified 5′ ends of the hairpin primer that were not ligated to RNA.The excess 5′ ends that are circularized could cause prematuresaturation of PCR because, although they are not attached to cDNA, theydo include the two PCR adapters. The 5′ ends of the hairpin primers canbe designed to form a restriction or cut site following circularizationto avoid exponential amplification during PCR and facilitate theirremoval by exonuclease-mediated digestion. For example, FIG. 2-4 shows ahairpin primer containing not only a hairpin priming site for reversetranscriptase, a nicking enzyme recognition sequence, a copy numberbarcode, a sample barcode, and two PCR adapters, but also two separatedhalves of a restriction site. The first half of the restriction siteoccurs on the 3′ end of the copy number barcode, and the second halfoccurs on the 5′ end of the hairpin primer (i.e., the 5′ end of the twoPCR adapters). The single stranded 5′ end of any excess hairpin primerthat is not ligated to an RNA molecule will be linearly amplified by thecombination of reverse transcriptase and a nicking enzyme. In addition,the resultant amplicons will be circularized in the same reaction thatcircularizes the cDNA library. However, for amplicons that do notinclude cDNA, the circularization reaction will join the two halves ofthe restriction site, allowing those amplicons to be selectivelyconverted to linear DNA using a restriction enzyme. The restrictiondigest product will not be exponentially amplified by PCR. Furthermore,this linearized excess DNA can be eliminated by digestion with anexonuclease (e.g., Exonuclease I, which will digest linearized DNA butnot circularized DNA).

Primers for specific reverse transcription of an organism's RNA can beobtained from the organism's genomic DNA. Genomic DNA can beenzymatically fragmented and ligated to PCR adapters that includerestriction sites for downstream isolation of primer-sized DNAfragments. The adapter-ligated fragments can be amplified by PCR usingPCR primers that include copy-number barcodes. In certain aspects, thisprocess results in a library of barcoded genomic fragments which canthen be cut into smaller fragments by one or more restriction enzymes.The smaller fragments are referred to as a genomic primer pool which canserve as a set of specific primers for capturing and reversetranscribing the RNA of the organism from which the pool was isolated.Furthermore, the fragments comprising the genomic primer pool includenot only a specific, genomic primer sequence, but also a copy numberbarcode.

Although random oligonucleotide sequences can serve as copy numberbarcodes, it may be desirable in some cases to place certain constraintson the set of sequences used for barcoding. For example, certainhomopolymeric sequences and sequences with very high G+C or A+T contentmay be difficult to amplify and sequence. In addition, barcode sequencesthat include other specific sequences or the complements of specificsequences to be used in library preparation (e.g., PCR adapters orsample barcodes) may be undesirable in certain circumstances. For agiven organism or sample type, design principles may be applied togenerate an optimized set of copy number barcodes. An optimized set ofbarcodes would contain a minimal number of members capable ofhybridizing efficiently to genetic material in the target organism or toadapters, amplification primers, capture primers, enzyme recognitionsequences, or sequencing primers used in sample preparation, qualitycontrol steps, or sequencing. In addition, an optimized barcode setwould not include barcode sequences containing large homopolymerictracts (e.g. >5 identical bases in a row), G-quadraplexes, or highly GC-or AT-rich sequences (e.g. GC-fraction should lie between 30-70%).Because such an optimized barcode set cannot be generated by random DNAsynthesis, methods of highly parallel, sequence-specific DNA synthesiscan be used. Methods for highly parallel, maskless array synthesis suchas those commercialized by Nimblegen, are known in the art.

In addition, computer based methods can be used to design an optimizedbarcode set in silico using software and design parameters as describedherein. The optimized barcode set may then be synthesized using methodsdescribed herein and/or known to those of skill in the art.

As shown in FIG. 1, cDNA with the unique barcode sequence is preparedfor amplification using methods known to those of skill in the art. Suchmethods include, for example, circularization as described inWO/2010/094040, single-stranded adapter ligation as described inWO/2010/094040, hybridization of random primers as described in U.S.Pat. No. 6,124,120, or double-stranded adapter ligation as described inU.S. Pat. No. 7,741,463 each of which are hereby incorporated byreference in their entireties.

As shown in FIG. 1, cDNA with the unique barcode sequences are amplifiedusing methods known to those of skill in the art. In certain aspects,amplification is achieved using PCR. The term “polymerase chainreaction” (“PCR”) of Mullis (U.S. Pat. Nos. 4,683,195, 4,683,202, and4,965,188) refers to a method for increasing the concentration of asegment of a target sequence in a mixture of nucleic acid sequenceswithout cloning or purification. This process for amplifying the targetsequence consists of introducing a large excess of two oligonucleotideprimers to the nucleic acid sequence mixture containing the desiredtarget sequence, followed by a precise sequence of thermal cycling inthe presence of a polymerase (e.g., DNA polymerase). The two primers arecomplementary to their respective strands of the double stranded targetsequence. To effect amplification, the mixture is denatured and theprimers then annealed to their complementary sequences within the targetmolecule. Following annealing, the primers are extended with apolymerase so as to form a new pair of complementary strands. The stepsof denaturation, primer annealing, and polymerase extension can berepeated many times (i.e., denaturation, annealing and extensionconstitute one “cycle;” there can be numerous “cycles”) to obtain a highconcentration of an amplified segment of the desired target sequence.The length of the amplified segment of the desired target sequence isdetermined by the relative positions of the primers with respect to eachother, and therefore, this length is a controllable parameter. By virtueof the repeating aspect of the process, the method is referred to as the“polymerase chain reaction” (hereinafter “PCR”). Because the desiredamplified segments of the target sequence become the predominantsequences (in terms of concentration) in the mixture, they are said tobe “PCR amplified.”

With PCR, it is possible to amplify a single copy of a specific targetsequence in genomic DNA to a level detectable by several differentmethodologies (e.g., hybridization with a labeled probe; incorporationof biotinylated primers followed by avidin-enzyme conjugate detection;incorporation of ³²P-labeled deoxynucleotide triphosphates, such as dCTPor dATP, into the amplified segment). In addition to genomic DNA, anyoligonucleotide or polynucleotide sequence can be amplified with theappropriate set of primer molecules. In particular, the amplifiedsegments created by the PCR process itself are, themselves, efficienttemplates for subsequent PCR amplifications. Methods and kits forperforming PCR are well known in the art. PCR is a reaction in whichreplicate copies are made of a target polynucleotide using a pair ofprimers or a set of primers consisting of an upstream and a downstreamprimer, and a catalyst of polymerization, such as a DNA polymerase, andtypically a thermally-stable polymerase enzyme. Methods for PCR are wellknown in the art, and taught, for example in MacPherson et al. (1991)PCR 1: A Practical Approach (IRL Press at Oxford University Press). Allprocesses of producing replicate copies of a polynucleotide, such as PCRor gene cloning, are collectively referred to herein as replication. Aprimer can also be used as a probe in hybridization reactions, such asSouthern or Northern blot analyses.

The expression “amplification” or “amplifying” refers to a process bywhich extra or multiple copies of a particular polynucleotide areformed. Amplification includes methods such as PCR, ligationamplification (or ligase chain reaction, LCR) and amplification methods.These methods are known and widely practiced in the art. See, e.g., U.S.Pat. Nos. 4,683,195 and 4,683,202 and Innis et al., “PCR protocols: aguide to method and applications” Academic Press, Incorporated (1990)(for PCR); and Wu et al. (1989) Genomics 4:560-569 (for LCR). Ingeneral, the PCR procedure describes a method of gene amplificationwhich is comprised of (i) sequence-specific hybridization of primers tospecific genes within a DNA sample (or library), (ii) subsequentamplification involving multiple rounds of annealing, elongation, anddenaturation using a DNA polymerase, and (iii) screening the PCRproducts for a band of the correct size. The primers used areoligonucleotides of sufficient length and appropriate sequence toprovide initiation of polymerization, i.e. each primer is specificallydesigned to be complementary to each strand of the genomic locus to beamplified.

Reagents and hardware for conducting amplification reaction arecommercially available. Primers useful to amplify sequences from aparticular gene region are preferably complementary to, and hybridizespecifically to sequences in the target region or in its flankingregions and can be prepared using the polynucleotide sequences providedherein. Nucleic acid sequences generated by amplification can besequenced directly.

When hybridization occurs in an antiparallel configuration between twosingle-stranded polynucleotides, the reaction is called “annealing” andthose polynucleotides are described as “complementary”. Adouble-stranded polynucleotide can be complementary or homologous toanother polynucleotide, if hybridization can occur between one of thestrands of the first polynucleotide and the second. Complementarity orhomology (the degree that one polynucleotide is complementary withanother) is quantifiable in terms of the proportion of bases in opposingstrands that are expected to form hydrogen bonding with each other,according to generally accepted base-pairing rules.

The terms “reverse-transcriptase PCR” and “RT-PCR” refer to a type ofPCR where the starting material is mRNA. The starting mRNA isenzymatically converted to complementary DNA or “cDNA” using a reversetranscriptase enzyme. The cDNA is then used as a template for a PCRreaction.

The terms “PCR product,” “PCR fragment,” and “amplification product”refer to the resultant mixture of compounds after two or more cycles ofthe PCR steps of denaturation, annealing and extension are complete.These terms encompass the case where there has been amplification of oneor more segments of one or more target sequences.

The term “amplification reagents” refers to those reagents(deoxyribonucleotide triphosphates, buffer, etc.), needed foramplification except for primers, nucleic acid template, and theamplification enzyme. Typically, amplification reagents along with otherreaction components are placed and contained in a reaction vessel (testtube, microwell, etc.). Amplification methods include PCR methods knownto those of skill in the art and also include rolling circleamplification (Blanco et al., J. Biol. Chem., 264, 8935-8940, 1989),hyperbranched rolling circle amplification (Lizard et al., Nat.Genetics, 19, 225-232, 1998), and loop-mediated isothermal amplification(Notomi et al., Nuc. Acids Res., 28, e63, 2000) each of which are herebyincorporated by reference in their entireties.

The cDNA samples can be amplified by polymerase chain reaction (PCR)including emulsion PCR and single primer PCR in the methods describedherein. For example, the cDNA samples can be amplified by single primerPCR. The CDS can comprise a 5′ amplification primer sequence (APS),which subsequently allows the first strand of cDNA to be amplified byPCR using a primer that is complementary to the 5′ APS. The TSO can alsocomprise a 5′ APS, which can be at least 70% identical, at least 80%identical, at least 90% identical, at least 95% identical, or 70%, 80%,90% or 100% identical to the 5′ APS in the CDS. This means that thepooled cDNA samples can be amplified by PCR using a single primer (i.e.by single primer PCR), which exploits the PCR suppression effect toreduce the amplification of short contaminating amplicons andprimer-dimers (Dai et al., J Biotechnol 128(3):435-43 (2007)). As thetwo ends of each amplicon are complementary, short amplicons will formstable hairpins, which are poor templates for PCR. This reduces theamount of truncated cDNA and improves the yield of longer cDNAmolecules. The 5′ APS can be designed to facilitate downstreamprocessing of the cDNA library. For example, if the cDNA library is tobe analyzed by a particular sequencing method, e.g. Applied Biosystems'SOLiD sequencing technology, or Illumina's Genome Analyzer, the 5′ APScan be designed to be identical to the primers used in these sequencingmethods. For example, the 5′ APS can be identical to the SOLiD P1primer, and/or a SOLiD P2 sequence inserted in the CDS, so that the PIand P2 sequences required for SOLiD sequencing are integral to theamplified library.

For emulsion PCR, an emulsion PCR reaction is created by vigorouslyshaking or stirring a “water in oil” mix to generate millions ofmicron-sized aqueous compartments. The DNA library is mixed in alimiting dilution either with the beads prior to emulsification ordirectly into the emulsion mix. The combination of compartment size andlimiting dilution of beads and target molecules is used to generatecompartments containing, on average, just one DNA molecule and bead (atthe optimal dilution many compartments will have beads without anytarget) To facilitate amplification efficiency, both an upstream (lowconcentration, matches primer sequence on bead) and downstream PCRprimers (high concentration) are included in the reaction mix. Dependingon the size of the aqueous compartments generated during theemulsification step, up to 3×10⁹ individual PCR reactions per μl can beconducted simultaneously in the same tube. Essentially each littlecompartment in the emulsion forms a micro PCR reactor. The average sizeof a compartment in an emulsion ranges from sub-micron in diameter toover a 100 microns, depending on the emulsification conditions.

“Identity,” “homology” or “similarity” are used interchangeably andrefer to the sequence similarity between two nucleic acid molecules.Identity can be determined by comparing a position in each sequencewhich can be aligned for purposes of comparison. When a position in thecompared sequence is occupied by the same base or amino acid, then themolecules are homologous at that position. A degree of identity betweensequences is a function of the number of matching or identical positionsshared by the sequences. An unrelated or non-homologous sequence sharesless than 40% identity, or alternatively less than 25% identity, withone of the sequences of the present invention.

A polynucleotide has a certain percentage (for example, 60%, 65%, 70%,75%, 80%, 85%, 90%, 95%, 98% or 99%) of “sequence identity” to anothersequence means that, when aligned, that percentage of bases are the samein comparing the two sequences. This alignment and the percent sequenceidentity or homology can be determined using software programs known inthe art, for example those described in Ausubel et al., CurrentProtocols in Molecular Biology, John Wiley & Sons, New York, N.Y.,(1993). Preferably, default parameters are used for alignment. Onealignment program is BLAST, using default parameters. In particular,programs are BLASTN and BLASTP, using the following default parameters:Genetic code=standard; filter=none; strand=both; cutoff-60; expect=10;Matrix=BLOSUM62; Descriptions=50 sequences; sort by=HIGH SCORE;Databases=non-redundant, GenBank+EMBL+DDBJ+PDB+GenBank CDStranslations+SwissProtein+SPupdate+PIR. Details of these programs can befound at the National Center for Biotechnology Information.

As shown in FIG. 1, the amplified cDNA is sequenced and analyzed usingmethods known to those of skill in the art. In certain exemplaryembodiments, RNA expression profiles are determined using any sequencingmethods known in the art. Determination of the sequence of a nucleicacid sequence of interest can be performed using a variety of sequencingmethods known in the art including, but not limited to, sequencing byhybridization (SBH), sequencing by ligation (SBL) (Shendure et al.(2005) Science 309:1728), quantitative incremental fluorescentnucleotide addition sequencing (QIFNAS), stepwise ligation and cleavage,fluorescence resonance energy transfer (FRET), molecular beacons, TaqManreporter probe digestion, pyrosequencing, fluorescent in situ sequencing(FISSEQ), FISSEQ beads (U.S. Pat. No. 7,425,431), wobble sequencing(PCT/US05/27695), multiplex sequencing (U.S. Ser. No. 12/027,039, filedFeb. 6, 2008; Porreca et al (2007) Nat. Methods 4:931), polymerizedcolony (POLONY) sequencing (U.S. Pat. Nos. 6,432,360, 6,485,944 and6,511,803, and PCT/US05/06425); nanogrid rolling circle sequencing(ROLONY) (U.S. Ser. No. 12/120,541, filed May 14, 2008), allele-specificoligo ligation assays (e.g., oligo ligation assay (OLA), single templatemolecule OLA using a ligated linear probe and a rolling circleamplification (RCA) readout, ligated padlock probes, and/or singletemplate molecule OLA using a ligated circular padlock probe and arolling circle amplification (RCA) readout) and the like.High-throughput sequencing methods, e.g., using platforms such as Roche454, Illumina Solexa, AB-SOLiD, Helicos, Complete Genomics, Polonatorplatforms and the like, can also be utilized. A variety of light-basedsequencing technologies are known in the art (Landegren et al. (1998)Genome Res. 8:769-76; Kwok (2000) Pharmacogenomics 1:95-100; and Shi(2001) Clin. Chem. 47:164-172).

The method of preparing a cDNA library described herein can furthercomprise processing the cDNA library to obtain a library suitable forsequencing. As used herein, a library is suitable for sequencing whenthe complexity, size, purity or the like of a cDNA library is suitablefor the desired screening method. In particular, the cDNA library can beprocessed to make the sample suitable for any high-throughout screeningmethods, such as Applied Biosystems' SOLiD sequencing technology, orIllumina's Genome Analyzer. As such, the cDNA library can be processedby fragmenting the cDNA library (e.g. with DNase) to obtain ashort-fragment 5′-end library. Adapters can be added to the cDNA, e.g.at one or both ends to facilitate sequencing of the library. The cDNAlibrary can be further amplified, e.g. by PCR, to obtain a sufficientquantity of cDNA for sequencing.

Embodiments of the invention provide a cDNA library produced by any ofthe methods described herein. This cDNA library can be sequenced toprovide an analysis of gene expression in single cells or in a pluralityof single cells.

Embodiments of the invention also provide a method for analyzing geneexpression in a plurality of single cells, the method comprising thesteps of preparing a cDNA library using the method described herein andsequencing the cDNA library. A “gene” refers to a polynucleotidecontaining at least one open reading frame (ORF) that is capable ofencoding a particular polypeptide or protein after being transcribed andtranslated. Any of the polynucleotide sequences described herein can beused to identify larger fragments or full-length coding sequences of thegene with which they are associated. Methods of isolating largerfragment sequences are known to those of skill in the art.

As used herein, “expression” refers to the process by whichpolynucleotides are transcribed into mRNA and/or the process by whichthe transcribed mRNA is subsequently being translated into peptides,polypeptides, or proteins. If the polynucleotide is derived from genomicDNA, expression can include splicing of the mRNA in an eukaryotic cell.

The cDNA library can be sequenced by any suitable screening method. Inparticular, the cDNA library can be sequenced using a high-throughoutscreening method, such as Applied Biosystems' SOLiD sequencingtechnology, or Illumina's Genome Analyzer. In one aspect of theinvention, the cDNA library can be shotgun sequenced. The number ofreads can be at least 10,000, at least 1 million, at least 10 million,at least 100 million, or at least 1000 million. In another aspect, thenumber of reads can be from 10,000 to 100,000, or alternatively from100,000 to 1 million, or alternatively from 1 million to 10 million, oralternatively from 10 million to 100 million, or alternatively from 100million to 1000 million. A “read” is a length of continuous nucleic acidsequence obtained by a sequencing reaction.

“Shotgun sequencing” refers to a method used to sequence very largeamount of DNA (such as the entire genome). In this method, the DNA to besequenced is first shredded into smaller fragments which can besequenced individually. The sequences of these fragments are thenreassembled into their original order based on their overlappingsequences, thus yielding a complete sequence. “Shredding” of the DNA canbe done using a number of difference techniques including restrictionenzyme digestion or mechanical shearing. Overlapping sequences aretypically aligned by a computer suitably programmed. Methods andprograms for shotgun sequencing a cDNA library are well known in theart.

The expression profiles described herein are useful in the field ofpredictive medicine in which diagnostic assays, prognostic assays,pharmacogenomics, and monitoring clinical trails are used for prognostic(predictive) purposes to thereby treat an individual prophylactically.Accordingly, one aspect of the present invention relates to diagnosticassays for determining the expression profile of nucleic acid sequences(e.g., RNAs), in order to determine whether an individual is at risk ofdeveloping a disorder and/or disease. Such assays can be used forprognostic or predictive purposes to thereby prophylactically treat anindividual prior to the onset of the disorder and/or disease.Accordingly, in certain exemplary embodiments, methods of diagnosingand/or prognosing one or more diseases and/or disorders using one ormore of expression profiling methods described herein are provided.

Yet another aspect of the invention pertains to monitoring the influenceof agents (e.g., drugs or other compounds administered either to inhibitor to treat or prevent a disorder and/or disease) on the expressionprofile of nucleic acid sequences (e.g., RNAs) in clinical trials.Accordingly, in certain exemplary embodiments, methods of monitoring oneor more diseases and/or disorders before, during and/or subsequent totreatment with one or more agents using one or more of expressionprofiling methods described herein are provided.

Monitoring the influence of agents (e.g., drug compounds) on the levelof expression of a marker of the invention can be applied not only inbasic drug screening, but also in clinical trials. For example, theeffectiveness of an agent to affect an expression profile can bemonitored in clinical trials of subjects receiving treatment for adisease and/or disorder associated with the expression profile. Incertain exemplary embodiments, the methods for monitoring theeffectiveness of treatment of a subject with an agent (e.g., an agonist,antagonist, peptidomimetic, protein, peptide, nucleic acid, smallmolecule, or other drug candidate) comprising the steps of (i) obtaininga pre-administration sample from a subject prior to administration ofthe agent; (ii) detecting one or more expression profiled in thepre-administration sample; (iii) obtaining one or morepost-administration samples from the subject; (iv) detecting one or moreexpression profiles in the post-administration samples; (v) comparingthe one or more expression profiled in the pre-administration samplewith the one or more expression profiles in the post-administrationsample or samples; and (vi) altering the administration of the agent tothe subject accordingly.

As used herein, the term “biological sample” is intended to include, butis not limited to, tissues, cells, biological fluids and isolatesthereof, isolated from a subject, as well as tissues, cells and fluidspresent within a subject. Many expression detection methods use isolatedRNA. Any RNA isolation technique that does not select against theisolation of mRNA can be utilized for the purification of RNA frombiological samples (see, e.g., Ausubel et al., ed., Current Protocols inMolecular Biology, John Wiley & Sons, New York 1987-1999). Additionally,large numbers of tissue samples can readily be processed usingtechniques well known to those of skill in the art, such as, forexample, the single-step RNA isolation process of Chomczynski (1989,U.S. Pat. No. 4,843,155).

The expression profiling methods described herein allow the quantitationof gene expression. Thus, not only tissue specificity, but also thelevel of expression of a variety of genes in the tissue isascertainable. Thus, genes can be grouped on the basis of their tissueexpression per se and level of expression in that tissue. This isuseful, for example, in ascertaining the relationship of gene expressionbetween or among tissues. Thus, one tissue can be perturbed and theeffect on gene expression in a second tissue can be determined. In thiscontext, the effect of one cell type on another cell type in response toa biological stimulus can be determined. Such a determination is useful,for example, to know the effect of cell-cell interaction at the level ofgene expression. If an agent is administered therapeutically to treatone cell type but has an undesirable effect on another cell type, theinvention provides an assay to determine the molecular basis of theundesirable effect and thus provides the opportunity to co-administer acounteracting agent or otherwise treat the undesired effect. Similarly,even within a single cell type, undesirable biological effects can bedetermined at the molecular level. Thus, the effects of an agent onexpression of other than the target gene can be ascertained andcounteracted.

In another embodiment, the time course of expression of one or morenucleic acid sequences (e.g., genes, mRNAs and the like) in anexpression profile can be monitored. This can occur in variousbiological contexts, as disclosed herein, for example development of adisease and/or disorder, progression of a disease and/or disorder, andprocesses, such a cellular alterations associated with the diseaseand/or disorder.

The expression profiling methods described herein are also useful forascertaining the effect of the expression of one or more nucleic acidsequences (e.g., genes, mRNAs and the like) on the expression of othernucleic acid sequences (e.g., genes, mRNAs and the like) in the samecell or in different cells. This provides, for example, for a selectionof alternate molecular targets for therapeutic intervention if theultimate or downstream target cannot be regulated.

The expression profiling methods described herein are also useful forascertaining differential expression patterns of one or more nucleicacid sequences (e.g., genes, mRNAs and the like) in normal and abnormalcells. This provides a battery of nucleic acid sequences (e.g., genes,mRNAs and the like) that could serve as a molecular target for diagnosisor therapeutic intervention.

In certain exemplary embodiments, electronic apparatus readable mediacomprising one or more expression profiles described herein is provided.As used herein, “electronic apparatus readable media” refers to anysuitable medium for storing, holding or containing data or informationthat can be read and accessed directly by an electronic apparatus. Suchmedia can include, but are not limited to: magnetic storage media, suchas floppy disks, hard disk storage medium, and magnetic tape; opticalstorage media such as compact disc; electronic storage media such asRAM, ROM, EPROM, EEPROM and the like; general hard disks and hybrids ofthese categories such as magnetic/optical storage media. The medium isadapted or configured for having recorded thereon one or more expressionprofiles described herein.

As used herein, the term “electronic apparatus” is intended to includeany suitable computing or processing apparatus or other deviceconfigured or adapted for storing data or information. Examples ofelectronic apparatuses suitable for use with the present inventioninclude stand-alone computing apparatus; networks, including a localarea network (LAN), a wide area network (WAN) Internet, Intranet, andExtranet; electronic appliances such as a personal digital assistants(PDAs), cellular phone, pager and the like; and local and distributedprocessing systems.

As used herein, “recorded” refers to a process for storing or encodinginformation on the electronic apparatus readable medium. Those skilledin the art can readily adopt any of the presently known methods forrecording information on known media to generate manufactures comprisingone or more expression profiles described herein.

A variety of software programs and formats can be used to store themarker information of the present invention on the electronic apparatusreadable medium. For example, the marker nucleic acid sequence can berepresented in a word processing text file, formatted incommercially-available software such as WordPerfect and Microsoft Word,or represented in the form of an ASCII file, stored in a databaseapplication, such as DB2, Sybase, Oracle, or the like, as well as inother forms. Any number of data processor structuring formats (e.g.,text file or database) may be employed in order to obtain or create amedium having recorded thereon one or more expression profiles describedherein.

By providing one or more expression profiles described herein inreadable form, one can routinely access the expression profileinformation for a variety of purposes. For example, one skilled in theart can use the one or more expression profiles described herein inreadable form to compare a target expression profile with the one ormore expression profiles stored within the data storage means. Searchmeans are used to identify similarities and/or differences between twoor more expression profiles.

FIG. 10(A) depicts the general concept of digital counting by randomlabeling of all target nucleic acid molecules in a sample with uniquebarcode sequences. Assume the original sample contains two cDNAsequences, one with three copies and another with two copies. Anoverwhelming number of unique barcode sequences are added to the samplein excess, and five are randomly ligated to the cDNA molecules. Ideally,each cDNA molecule in the sample receives a unique barcode sequence.After removing the excess barcodes, the barcoded cDNA molecules areamplified by PCR. Because of intrinsic noise and sequence-dependentbias, the barcoded cDNA molecules may be amplified unevenly.Consequently, after the amplicons are sequenced, it may appear thatthere are three copies of cDNA1 for every four copies of cDNA2 based onthe relative number of reads for each sequence. However, the ratio inthe original sample was 3:2, which is accurately reflected in therelative number of unique barcodes associated with each cDNA sequence.In the implementation of the method depicted in FIG. 10(A), it may beadvantageous to randomly ligate both ends of each phosphorylated cDNAfragment to a barcoded phosphorylated Illumina Y-shaped adapter as shownin FIG. 10(B). Note that the single T and A overhangs present on thebarcodes and cDNA, respectively, are to enhance ligation efficiency.After this step, the sample is amplified by PCR and prepared forsequencing using the standard Illumina library protocol. For eachamplicon, both barcode sequences and both strands of the cDNA sequenceare read using paired-end deep sequencing. According to one aspect, thepaired-end strategy, i.e. attaching a barcode to each end of the nucleicacid in a sample, reduces the number of barcodes that must be designedand synthesized while allowing conventional paired-end library protocolsand providing long-range sequence information that improves mappingaccuracy.

It is to be understood that the embodiments of the present inventionwhich have been described are merely illustrative of some of theapplications of the principles of the present invention. Numerousmodifications may be made by those skilled in the art based upon theteachings presented herein without departing from the true spirit andscope of the invention. The contents of all references, patents andpublished patent applications cited throughout this application arehereby incorporated by reference in their entirety for all purposes.

The following examples are set forth as being representative of thepresent invention. These examples are not to be construed as limitingthe scope of the invention as these and other equivalent embodimentswill be apparent in view of the present disclosure, figures andaccompanying claims.

Example I Single Cell RNA Quantification with Copy Number Barcodes CellSelection by Laser Dissection Microscopy

Prior to cell selection, the intracellular RNA is stabilized by adding0.5 mL of RNALater (Ambion) to every 0.1 mL of cell culture. A cell isthen selected, cut from a culture dish, and dispensed in a tube using alaser dissection microscope (LMD-6500, Leica). The cells are plated ontoa membrane-coated culture dish and observed using bright fieldmicroscopy with a 10× objective (Leica). A UV laser is then used to cutthe membrane around an individually selected cell such that it fallsinto the cap of a PCR tube containing 20 μL of buffer. The captured cellis then thermally lysed.

Genomic DNA Removal

Genomic DNA can be removed from the sample by the addition of 1 μL of0.1 U/μL DNase I (New England BioLabs) and incubation at 37 C for 15minutes. The reaction is the quenched by the addition of EDTA to a finalconcentration of 5 mM and heat inactivation at 75 C for 10 minutes.

Reverse Transcription and Copy Number Barcoding Using RNA Ligation withLinear Amplification

To facilitate reverse transcription, a DNA oligonucleotide with thefollowing sequence is ligated to each RNA molecule in the sample:

(SEQ ID NO: 1) 5′-P GCAGATCGGAAGAGGCTCGTTGAGGGAACAGGTTCAGAGTTCTACAGTCCGACGATCGGCNNNNNNNNNNNNNNNNNNNNTCCAACCGAAGAGCGGTGCACCGTCCGAGTCGACGTCTGGATCCTGTTCTTTCTCAGGATCCAGACGTCGACTCGGACGGTGCACCGCTCTTCG-3′

This oligonucleotide includes a hairpin primer for reversetranscription, two PCR adapters, the recognition sequence of a nickingenzyme, a sample barcode (for massively parallel sequencing), and a20-base copy number barcode, and the 5′ end is phosphorylated. Thehairpin primer set is added to the single cell RNA sample to a finalconcentration of 1 μM such that nearly every RNA molecule in the samplewill be conjugated to a different copy number barcode. The hairpinprimers are ligated to the RNA molecules using T4 RNA Ligase 1 under thefollowing conditions:

1×T4 RNA Ligase 1 Reaction Buffer (New England Biolabs, Ipswich, Mass.)

0.7 units/μL T4 RNA Ligase 110% (v/v) DMSO

The reaction mixture is incubated for 12 hours at 16° C. LambdaExonuclease and Exonuclease I are then added to the sample at finalconcentrations of 1 unit/μL, and the sample is incubated at 37° C. forone hour to digest excess hairpin primer. The two exonucleases are thenheat inactivated by incubation at 80° C. for 20 minutes.

The hairpin primed RNA templates are then linearly amplified by thecombination of a reverse transcriptase with strand-displacement activityand a nicking enzyme. The nicking enzyme cuts at a recognition sequenceadjacent to the hairpin primer whenever reverse transcriptase generatesdouble-stranded DNA at the recognition site. This cut site allowsreverse transcriptase to copy the RNA template repeatedly, removing thepreviously generated cDNA with its strand-displacement activity. Toaccomplish this, MonsterScript (Epicentre) and Nt.BspQI (New EnglandBiolabs) are added to the sample at final concentrations of 0.5 units/μLand 2 units/μL, respectively, along with 0.2 mM dNTPs. The reactionmixture is then incubated for one hour at 50° C.

The linearly amplified cDNA is then circularized using CircLigase II(Epicentre Biotechnologies, Madison, Wis.). Manganous chloride, betaine,and CircLigase II are added to the sample at final concentrations of 2.5mM, 1 M, and 5 units/μL, respectively. The new reaction mixture is thenincubated for one hour at 60° C.

Sequencing Library Preparation and Massively Parallel Sequencing

Using the two adapters included in each cDNA molecule, the cDNA isamplified by PCR. This is accomplished using the manufacturer'sinstructions for library preparation and varies slightly for differentmassively parallel sequencing platforms (e.g., Roche/454 Life Sciences,SOLiD by Life Technologies, HiSeq 2000 or Genome Analyzer by Illumina).In this particular example, Illumina adapter sequences are included inthe cDNA library such that the circularized templates can be amplifiedusing the standard Illumina library preparation kit (e.g. TruSeq) andsequenced on a HiSeq 2000.

Example II DNA Quantification Using Copy Number Barcodes

Eight DNA template oligos (135 bases) were purchased from Integrated DNATechnologies along with two primers:

Template A:  (SEQ ID NO: 2)5′-ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNTGGTGGAGCTGGCGGGAGTTGAACCCGCGTCCGAAATTCCTACATCCTCGGTACTACATGGCCGTCGTATGCCGTCTTCTGCTTG-3′ Template B:(SEQ ID NO: 3) 5′-ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNTCGGGCCGGGGGTTGGGCCAGGCTCTGAGGTGTGGGGGATTCCCCCATGCCCCCCGCCGTGCCGTCGTATGCCGTCTTCTGCTTG-3′ Template C:(SEQ ID NO: 4) 5′-ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNTTATAAATACCGGCCCCGGCGGAAAACCAAGACGCTCATGAAGAAGGATAAGTACACGCTGCCGTCGTATGCCGTCTTCTGCTTG-3′ Template D:(SEQ ID NO: 5) 5′-ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNCGCCGCGGGGTGCACCGTCCGGACCCTGTTTTCAGGGTCCGGACGGTGCACCCCGCGGCGGCCGTCGTATGCCGTCTTCTGCTTG-3′ Template E:(SEQ ID NO: 6) 5′-ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNCAAGCAGAAGACGGCTCCGGGACCGTCCGGACCCTGTTTTCAGGGTCCGGACGGTCCCGGGCCGTCGTATGCCGTCTTCTGCTTG-3′ Template F:(SEQ ID NO: 7) 5′-ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNGTTGCAGAAGACGGCTCCGGGACCGTCCGGACCCTGTTTTCAGGGTCCGGACGGTCCCGGGCCGTCGTATGCCGTCTTCTGCTTG-3′ Template G:(SEQ ID NO: 8) 5′-ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNCGCCGCGGTGCACCTTTTGGTGCACCGCGGCGCCCGCGTCCGAAATTCCTACATCCTCGGGCCGTCGTATGCCGTCTTCTGCTTG-3′ Template H:(SEQ ID NO: 9) 5′-ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNGTGAGAGAGTGAGCGAGACAGAAAGAGAGAGAAGTGCACCAGCGAGCCGGGGCAGGAAGAGCCGTCGTATGCCGTCTTCTGCTTG-3′ Primer 1: (SEQ ID NO: 10) 5′-AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA-3′Primer 2:  (SEQ ID NO: 11) 5′-CAAGCAGAAGACGGCATACGACGGC-3′

“N” designates a random base. The 3′ and 5′ ends of each template arecomplementary to Primer 1 and the complement of Primer 2, respectively.Each of the eight DNA oligos includes 16 random bases which serve as acopy number barcode. The templates are diluted identically in twodifferent tubes containing a PCR Master Mix such that the average copynumber of each template is as follows:

Templates A-D: 1 copy per tubeTemplate E: 10 copies per tubeTemplate F: 100 copies per tubeTemplate G: 10,000 copies per tubeTemplate H: 1,000,000 copies per tube.

The PCR Master Mix consists of:

0.5 μM Primer 1 0.5 μM Primer 2

0.2 mM dNTPs (New England Biolabs, Ipswich, Mass.)

1× Phusion HF Buffer (New England Biolabs)

0.02 units/μL Phusion DNA Polymerase (New England Biolabs)at a final volume of 50 μL.

The two PCR samples are then thermocycled as follows:

-   -   1) 98 C for 30 s    -   2) 98 C for 10 s    -   3) 60 C for 30 s    -   4) 72 C for 30 s    -   5) Repeat steps 2-4 19 times.    -   6) 72 C for 10 minutes.

The two PCR samples are combined with 20 μL of ExoSAP-IT PCR ProductCleanup mixture (USB) and incubated for 15 minutes at 37 C followed byheat inactivation for 15 minutes at 80 C. Each of the two samples isthen sequenced in one lane of a HiSeq 2000 (Illumina) sequenceraccording to the manufacturer's instructions resulting in approximately10⁸ reads per sample. For each template sequence, the number ofdifferent copy number barcode sequences is counted to determine the copynumber of the template sequence in the original sample.

Example III Generation of a Genomic Primer Pool with Copy NumberBarcodes

Genomic DNA is fragmented using Fragmentase (New England Biolabs) bycombining 5 μg of purified genomic DNA (FIG. 4B), 10 μL of NEBNext dsDNAFragmentase, 1×NEBNext dsDNA Fragmentation Reaction Buffer (New EnglandBiolabs), and 0.1 mg/mL BSA in a final volume of 100 μL. The reactionmixture is then incubated at 37 C for 30 minutes, quenched by theaddition of 100 mM EDTA, purified on a DNA purification column (ZymoResearch), and eluted in a final volume of 35 μL. FIG. 4B depicts anelectrophoretic gel in which samples taken from different time points inthe Fragmentase reaction were run in different lanes of the gel. Thefirst and second lanes contain standard DNA ladder sequences, and thethird, fourth, and fifth lanes contain samples taken 20 minutes, 30minutes, and 40 minutes from the initiation of DNA fragmentation,respectively. The gel shows that the average DNA length in the sampledecreases as the reaction progresses.

Fragment end repair is accomplished using the NEBNext End Repair EnzymeMix from New England Biolabs. The entire 35 μL of purified DNA fragmentsfrom the previous step are combined with 5 μL of NEBNext End RepairEnzyme Mix, 1×NEBNext End Repair Reaction Buffer, and E. coli Ligase(New England Biolabs) at 0.1 units/μL in a final volume of 100 μL. Thereaction mixture is then incubated for 30 minutes at 20 C followed bypurification with a DNA purification column (Zymo Research) and elutioninto a final volume of 42 μL. Deoxyadenosine tails (dA-tails) are thenadded to the purified, end-repaired DNA by combining all 42 μL ofpurified DNA with 1×NEBNext dA-Tailing Reaction Buffer and 3 μL Klenowfragment exo- (New England Biolabs) in a final volume of 50 μL. Thisreaction mixture is then incubated for 30 minutes at 37 C followed bypurification with a DNA purification column (Zymo Research) and elutioninto a final volume of 8 μL.

FIG. 4A shows a schematic of a DNA fragment after adapter ligation. TwoPCR adapter oligonucleotides are ligated to the 5′-end and 3′-end ofeach DNA fragment in the sample. As shown in FIG. 4A, the two PCRadapters are not only complementary to a set of PCR primers, they alsocontain a recognition sequence for MmeI, a restriction enzyme whose cutsite is ˜20 bases away from the recognition sequence. Oligonucleotidesthat include PCR adapters at their 5′ end and a recognition site for therestriction enzyme MmeI at their 3′ end are ligated onto the purified,dA-tailed genomic DNA fragments using Quick Ligase (New EnglandBiolabs). This is accomplished by combining all 8 μL of purified DNAfrom the previous step with the oligonucleotides at a finalconcentration of 2 μM, 5 μL of Quick T4 DNA Ligase, and 1× QuickLigation Reaction Buffer in a final volume of 50 μL. This reactionmixture is then incubated for 15 minutes at 20 C and loaded onto a 1.5%agarose gel which is run at 120 V for 50 minutes. The gel is stainedwith SybrSafe (Invitrogen), and the band that appears on the gel between300 and 400 bp is then cut and the DNA is isolated from the gel using agel purification kit (Qiagen) according to the manufacturer'sinstructions.

Following adapter ligation, the genomic fragments are amplified withPCR. The PCR primers used in this amplification step include copy numberbarcodes on their 5′ ends so that the final amplicons are randomly orpseudo-randomly barcoded. In addition, the PCR primers can bebiotinylated to facilitate isolation of the genomic primer pool in thefinal step of this protocol. For PCR, the purified, adapter-ligatedgenomic fragments are diluted to a final concentration of 100 fM andcombined with 0.5 μL of Phusion DNA polymerase (New England Biolabs), 1×Phusion High Fidelity Buffer (New England Biolabs), 0.2 mM dNTPs, and 2μM primers in a final volume of 50 μL. The PCR mixture is thenthermocycled as follows:

-   -   1) 98 C for 30 s    -   2) 98 C for 5 s    -   3) 68 C for 15 s    -   4) 72 C for 10 s    -   5) Repeat steps 2-4 24 times.    -   6) 72 C for 10 minutes.

The resulting PCR amplicons are then purified with a DNA purificationcolumn (Zymo Research).

FIG. 4C depicts an electrophoretic gel with lanes corresponding tosamples in which different restriction enzymes were used to generategenomic primers. The first, second, and third lanes contain digestionproducts for BpuE1, BpmI, and MmeI, respectively. In each case, thelowest band corresponds to the digestion product which can be used as apool of genomic primers. In order to isolate the genomic primer pool,the PCR amplicons are cut using MmeI, a restriction enzyme that cuts thedouble-stranded amplicons 20 bases from its recognition site, as shownin FIG. 4C. The recognition site for MmeI was included in theoligonucleotides that were ligated to the genomic fragments such thatdigestion by MmeI will result in fragments that include a copy numberbarcode, a PCR adapter, a single-A base, and 19 bases of the genomicfragment. The restriction digest mixture includes 9 μL of purified PCRproduct, lx NEB Buffer 4, 0.2 units/μL MmeI (New England Biolabs), and50 μM S-adenosylmethionine in a final volume of 20 μL. The reactionmixture is incubated at 37 C for one hour followed by purification ofsingle stranded fragments using streptavidin-coated magnetic beadsaccording to the manufacturer's instructions for Roche/454 librarypreparation.

Example IV DNA Counting with a Copy Number Barcode Using SangerSequencing

A series of DNA templates that either contained or did not contain acopy number barcode were amplified in a single tube by PCR with a commonset of primers. This was accomplished by combining 1× Taq Master Mix(New England Biolabs), 0.5 μM of each PCR primer, and 10 fM of each DNAtemplate in a final volume of 20 μL. The reaction mixture wasthermocycled under the following conditions:

-   -   1) 94 C for 10 min    -   2) 94 C for 30 s    -   3) 58 C for 30 s    -   4) 72 C for 4 s    -   5) Repeat steps 2-4 29 times.    -   6) 72 C for 10 minutes.

The two DNA templates and PCR primers had the following sequences:

Template 1 (with copy number barcode):  (SEQ ID NO: 12)5′-CCCTACACGACGCTCTTCCGATCTNNNNNNAATGATACGGCGACCAC CGAGATCTACACT-3′Template 2 (without copy number barcode):  (SEQ ID NO: 13)5′-CCCTACACGACGCTCTTCCGATCTAGCTCAAATGATACGGCGACCAC CGAGATCTACACT-3′PCR Primer 1:  (SEQ ID NO: 14) 5′-AGTGTAGATCTCGGTGGTCGCCG-3′PCR Primer 2:  (SEQ ID NO: 15) 5′-CCCTACACGACGCTCTTCCGATC-3′

The amplicons were then cloned using a TA cloning kit (Invitrogen) andsent out for Sanger sequencing (Genewiz). As shown in FIG. 5, variantsequenced at the copy number barcode site were only found when the DNAtemplate and copy number barcode were used, showing that there are atleast seven DNA copies in the sample.

Example V Copy Number Barcode Incorporation and RNA Counting

Reverse transcription was performed using DNA primers that eithercontained or did not contain a copy number barcode just as in the twocases shown in FIG. 6. In this example, there are two copies of RNA inthe original solution. As a first step, reverse transcription isperformed by using DNA primer without (above) and with (bottom) CNB (6bases in FIG. 6). The CNB is actually a random nucleotide sequence whichcan be sequenced. The cDNA generated by reverse transcription isamplified and sequenced. By counting the CNBs, one can learn how manycopies of RNA were in the original solution. This technique can beapplied to a mixture of multiple RNA sequences in a single tube.

RNA was first obtained using an in vitro transcription kit (Ambion) toextract RNA from the φX174 viral genome with the following sequence:

(SEQ ID NO: 25) 5′-AATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTG-3′

The reverse transcription primer that includes a copy number barcode hasthe following sequence:

(SEQ ID NO: 26) 5′-GAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNCGGTCGTCAGCCAACGTGAGAGTG-3′

This primer is phosphorylated on its 5′ end. The sequence of the reversetranscription primer that does not include a copy number barcode is asfollows:

(SEQ ID NO: 27) 5′-GAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTAGCTGTCGGTCGTCAGCCAACGTGAGAGTG-3′

The primers at a final concentration of 2.2 μM were first annealed tothe RNA sample at a final concentration of 2.1 μM in a total volume of4.5 μL by heating the sample to 90 C and cooling gradually to 25 C,pausing every 5 C for one minute. The RNA sample was then reversetranscribed by combining 1× First-Strand Buffer (Invitrogen), 5 mM DTT,0.5 mM dNTPs, and SuperScript III (Invitrogen, Carlsbad, Calif.) at afinal concentration of 20 units/μL. The reaction mixture was thenincubated at 37 C for 30 minutes. The sample was then treated withExonuclease I (New England Biolabs) at a final concentration of 2units/μL and incubated for 60 minutes at 37 C. The exonuclease was theninactivated by adding EDTA to a final concentration of 7.7 mM andincubating at 80 C for 25 minutes. The RNA was then digested bycombining RNase H (Invitrogen) at 0.06 units/μL, RNase A (Qiagen) at0.28 mg/mL, and 5.6 mM magnesium chloride. The reaction mixture wasincubated for 30 minutes at 37 C followed by purification with aPuriZymo column (Zymo Research Corporation, Irvine, Calif.) and elutioninto a final volume of 6 μL. The final concentration of cDNA was foundto be 1.6 μM.

The purified cDNA was circularized by combining 0.4 μM cDNA with 2.5 mMmanganous chloride, 1 M betaine, 1× CIRCLIGASE™ II Reaction Buffer(Epicentre), and 0.1 μM CIRCLIGASE™ II (Epicentre). The reaction mixturewas incubated for 1.5 hours at 60 C followed by heat inactivation for 10minutes at 80 C. Any remaining linear DNA was then removed by theaddition of Exonuclease I at a final concentration of 0.5 units/μL andincubation at 37 C for 30 minutes followed by heat inactivation for 20minutes at 80 C.

The cDNAs were then amplified by PCR using two primers:

(SEQ ID NO: 28) 5′-AGTGTAGATCTCGGTGGTCGCCG-3′ (SEQ ID NO: 29)5′-CCCTACACGACGCTCTTCCGATC-3′

By combining the cDNA with primers at 0.5 μM, dNTPs at 0.2 mM, 1×Phusion High Fidelity Buffer (New England Biolabs), and 0.5 μL ofPhusion DNA polymerase (New England Biolabs) in a final volume of 20 μL.The reaction mixture was thermocycled under the following conditions:

-   -   1) 98 C for 30 s    -   2) 98 C for 10 s    -   3) 58 C for 15 s    -   4) 72 C for 35 s    -   5) Repeat steps 2-4 29 times.    -   6) 72 C for 8 minutes.

The PCR product was gel purified on a 1.5% agarose gel run at 120 V for30 minutes followed by purification with a DNA purification column (ZymoResearch Corporation). The purified PCR product was then cloned using aTA Cloning Kit (Invitrogen) and sent out for Sanger sequencing (Genewiz,South Plainfield, N.J.). Just as in the case of DNA counting, variantsequences at the copy number barcode site were found only when the DNAprimer containing the copy number barcode was used, showing that thereare at least eight different RNA copies in the original sample. See FIG.7.

Example VI Investigation of PCR Bias Using qPCR

The amplification bias associated with three DNA oligos was investigatedusing quantitative PCR (qPCR). The sequences of the three DNA templateswere as follows:

Template A  (SEQ ID NO: 2)ACAGGTTCAGAGTTCTACAGTCCGACGATCAGCTNNNNNNNNNNNNNNNNTGGTGGAGCTGGCGGGAGTTGAACCCGCGTCCGAAATTCCTACATCCTCGGTACTACATGGCCGTCGTATGCCGTCTTCTGCTTG Template I  (SEQ ID NO: 30)ACAGGTTCAGAGTTCTACAGTCCGACGATCNNNNNNNNNNNNNNNNCCGGGACCGTCCGAGCTTCGGATACCTAGACAAGCAGAAGACGGCTGACCCTGTTTTCAGGGTCGCCGTCGTATGCCGTCTTCTGCTTG Template J  (SEQ ID NO: 31)ACAGGTTCAGAGTTCTACAGTCCGACGATCNNNNNNNNNNNNNNNNCGTAGTACCGTCCGGACCCTGTTTTCAGGGTCCGGACGGGCACAGATTAGCACCCTATCGACGAGCCGTCGTATGCCGTCTTCTGCTTG

The amplification efficiency of the three DNA templates was measuredusing qPCR. Primer 1 (SEQ ID NO:10) and Primer 2 (SEQ ID NO:11) wereadded to 1× Fast SYBR Green Master Mix (Applied Biosystems) to a finalconcentration of 0.5 μM. One of the three DNA template samples was thenadded to the PCR reaction mixture to one of four final concentrations(0.1 pM, 1 pM, 10 pM, or 100 pM). Amplification efficiencies wereobtained for each reaction mixtures using a Fast Real-Time PCR 7500System (Applied Biosystems) under the following thermocyclingconditions:

-   -   1) 98 C for 20 s    -   2) 98 C for 3 s    -   3) 60 C for 30 s    -   4) Repeat steps 2-3 59 times

The resultant amplification efficiencies are shown in FIG. 9.

Example VII Generation and Optimization of Barcodes

A set of 2,358 random 20-base optimized barcodes having a distance ofnine was prepared using a computer such that even if a barcodeaccumulated nine mutations, it would not take the sequence of any othergenerated barcode sequences. Suitable software programs are commerciallyavailable that can be used to generate optimized barcode sequences insilico such as Python version 3.2.1 (world wide web python.org) and gcccompiler for C version 4.1.2 (world wide web gcc.gnu.org) which wereused for the generation and filtering of the barcode candidatesdescribed below.

A first barcode candidate containing 20 nucleotides was randomlygenerated in silico. Then, a second 20 nucleotide potential barcodecandidate was randomly generated in silico and the number of sequencemismatches required to regenerate the first barcode candidate from thepotential candidate (threshold value) was determined. If the thresholdvalue is greater than 9 (i.e., the distance), the potential candidate iskept and added to the set of final barcode candidates; if not, thepotential candidate is discarded and new potential candidates aregenerated randomly until the threshold value between the new potentialcandidate and the first barcode candidate is greater than 9; this newpotential barcode candidate is then kept and added to the set of finalbarcode candidates. Subsequent sequences are then generated randomly andcompared to all previously kept barcode candidates and discarded ifallowing 9 or fewer sequence mismatches will allow the new potentialbarcode candidate to exactly take on the sequence of any previously keptbarcode candidate. This process was repeated until we achieved a set of2,358 final barcode candidates. It is to be understood that one can usethis procedure on barcode candidates of any length or multiple lengthsand with any threshold value criterion. Running this procedure multipletimes should not result in the exact same set of barcode candidatesequences between runs.

One is also able to systematically generate barcode candidates by usingthe Hadamard Code as described in Bose R C, Shrikhande S S (1959) A noteon a result in the theory of code construction. Information and Control2:183-194 hereby incorporated by reference herein in its entirety, whichallows generation of a set number of barcode candidates each a set,specific threshold value (number of allowed sequence mismatches requiredfor a barcode candidate to exactly match the sequence of another barcodecandidate) apart from all other barcode candidates. However, thistechnique is limited in that it requires that the length of the barcodebe a power of 2.

Barcode candidates containing homopolymers longer than four bases orGC-content less than 40% or greater than 60% were discarded. Barcodecandidates were also discarded if each exceeded a certain degree ofcomplementarity or sequence identity (total matches and maximumconsecutive matches) with (1) the Illumina paired-end sequencing primersdescribed in Bentley, Nature 456:53-59 (2008) hereby incorporated byreference in its entirety, (2) the Illumina PCR primers PE 1.0 and 2.0,(3) the 3′ end of the Illumina PCR primers PE 1.0 and 2.0, (4) the wholeE. coli genome [K-12 MG1655 strain (U00096.2)], and (5) all othergenerated barcode candidates. Any barcode candidate for which an indelmutation would place it within five point mutations of another barcodecandidate was also discarded. The final population consisted of 150barcodes the sequences of which were identified, of which 145 wererandomly chosen and used. It is to be understood that the presentdisclosure is not limited to specific sequences of barcodes within a setas the design of each set is likely to vary depending on the criteriaused to identify barcodes and the methods to create them in silico.

Specifically, each of the 2,358 barcodes was analyzed for sequencecharacteristics that would contribute to either amplification orsequencing errors as follows.

Initial Filtering.

All barcodes with less than 40% or greater than 60% GC content orcontaining homopolymers greater than length four were deleted. Allbarcode sequences were compared to the PE 1.0 and 2.0 Illumina PCRprimer sequences and discarded if there were more than 10 total basematches or more than five consecutive base matches in any possiblealignment with either primer sequence (sense and antisense). Eachbarcode was compared to the final four, five, and six bases closest tothe 3′ end of the PCR primer sequence (sense and antisense),respectively. The final six bases for both the PE 1.0 and PE 2.0 areidentical. If any of these regions contained more than three consecutivebase matches with a given barcode in any possible alignment, thatbarcode was discarded. All barcode sequences were compared to all otherbarcode sequences (including cases of offset sequences), and a barcodewas discarded if there were more than 15 total base matches or 10consecutive base matches to any other barcode sequence (sense andantisense) in any possible alignment. The total number of hydrogen bondsin the longest consecutive matching region of each barcode to any otherbarcode sequence was calculated, and barcodes with greater than 26 totalhydrogen bonds in that region were discarded. Each barcode was alignedwith the entire E. coli genome (sense and antisense). If at any positionin the genome a barcode contained more than 16 total matching bases, 12consecutive matching bases, or 32 hydrogen bonds present in any givenconsecutively matching region, that barcode was discarded. Finally, allpossible indels were generated for each barcode and the resultingsequence was compared to each original barcode sequence. If theresulting indel sequence could incur less than five point mutations andresult in the exact sequence of any barcode, the barcode sequence thatgenerated the given indel was deleted.

Score Filtering.

In addition to the thresholds described above for characterizing anoptimized barcode set, a more in-depth analysis of barcode-barcode andbarcode-E. coli genome hybridization was performed particularlyregarding barcode hybridization melting temperatures with respect to PCRamplification. Although for a given barcode sequence, when comparingcomplementarity to a large set of reference sequences through sequencealignment there will be an alignment condition which results in a regionin the barcode sequence where the absolute maximum number of consecutivebase matches is achieved (as described above); there are other possibleconditions when a barcode contains a region where the number ofconsecutive bases in the region of maximum consecutive base matches doesnot reach the absolute maximum value as described above. This value forany given region is referred to as the score. When comparing a barcodeto the sense and antisense sequences of all other barcodes, allalignment conditions were determined in the cases where the score of theregion that contained the highest number of consecutive base matches(first score) were 10, nine, eight, and seven bases, respectively. Foreach of these four first scores, the condition where the maximum scoreof the region where the second-highest number of consecutive basematches (second score) occurred was determined. For example, a conditionwhere the first score is 10 and the second score is three is denoted as10-3 and this condition is defined as the duplex. If the sum of thefirst score and the second score compared to all other barcode sequencesfor any barcode was greater than 12, that barcode was discarded. If thedistance between these two regions (maximum and second-most consecutivematches) was one base, and the sum of the first score and the secondscore was greater than 11, the barcode was discarded. The maximum valueof the number of consecutive base matches for a region under allalignment conditions that contained the third-highest number consecutivebase matches (third score) was also determined for all barcodes. Giventhe maximum third score, the respective maximum first score wasdetermined; the maximum second score given both of these conditions wasalso determined. This condition is defined as the triplex and denoted,for example, as 7-4-3. Barcodes with the following triplexes weremanually deleted: 7-3-3, 6-5-4, 6-5-3, 6-4-4, 5-5-4, and 5-4-4. Barcodeswith a triplex of 6-4-3 were deleted where both the distances betweenadjacent regions corresponding to the scores was one base.

The same analysis was done for all barcodes aligned against the entireE. coli genome (sense and antisense). Barcodes with a first score andsecond score sum of greater than 15 were discarded. Barcodes where thefirst score region and the second score region were separated by onenucleotide and had a first score and second score sum of more than 14were discarded as well. Barcodes with the following triplexes weredeleted: 8-4-4, 7-5-4, 6-6-4, 6-5-5, and 5-5-5. After filtering, a totalof 150 barcodes remained.

Example VIII In-Depth Design and Preparation of Adapter

Adapter Design.

To avoid sequencing errors resulting from cluster overlap (i.e. lowsequence complexity) and to reduce potential ligation bias, anadditional two to five base extension—CT, ACT, GACT, or TGACT—was addedto the 3′-end of each barcode. These sequences mimic the T-overhang inthe conventional Illumina paired-end adapter and conserve the sequenceof the last two bases. For each of the 150 final barcodes, these fourdifferent adapter extensions were attached to the 3′ end of the barcode.The same values as used in the initial filtering step (see above) wereobtained for each of the four adapter candidates for all 150 barcodes.The following four parameters of analysis were determined: PCR primermatching (PC), 3′ end of PCR primer matching (TP), barcode-barcodematching (BB), and barcode-E. coli genome matching (EC) and calculatedthe complementarity score of each category for all barcode-adaptercandidates as follows: PC={Sum of [maximum total base matches to the PE1.0 and PE 2.0 PCR primers (sense and antisense for a total of fourterms)]}+2·{Sum of [maximum consecutive base matches to the PE 1.0 andPE 2.0 PCR primers (sense and antisense for a total of four terms)]²};TP={Sum of [maximum total base matches to the final four bases of thePCR primer sequence (sense and antisense for a total of twoterms)]²}+1.5·{Sum of [maximum total base matches to the final fivebases of the PCR primer sequence (sense and antisense for a total of twoterms)]²}+2·{Sum of [maximum total base matches to the final six basesof the PCR primer sequence (sense and antisense for a total of twoterms)]²}; BB={Sum of [maximum total base matches to all other barcodecandidates (sense and antisense for a total of two terms)]²}+2·{Sum of[maximum consecutive base matches to all other barcode candidates (senseand antisense for a total of two terms)]²}; EC=Maximum total basematches to entire E. coli genome (sense only)+2·[maximum consecutivebase matches to the entire E. coli genome (sense only)]².

The total complementarity score (TC) for each barcode candidate wascalculated as follows: TC=3·PC+15·TP+BB+EC. The TC value gives a metricto determine the expected efficacy of each barcode candidate during PCRamplification. A low TC represents a lower chance of amplificationerrors caused by unwanted hybridization between barcodes and adapters,primers, or the sample. For each barcode, the barcode-adapter candidatewas selected that had either the lowest or second lowest TC among thefour. This resulted in 150 final barcode-adapter sequences, of which 145were randomly chosen and used. 37 CT extensions and 36 of each of theother three extensions were used. Adapters were then designed in thesame Y-shaped construct as the conventional Illumina paired-end adapterwith a 22 to 25 base-pair extension that contained the barcode and aT-overhang (FIG. 10B). Both strands (A and B) of the adapter wereordered from Integrated DNA Technologies (IDT).

Adapter Generation.

The 5′-end of strand B was phosphorylated in T4 DNA Ligase ReactionBuffer (New England Biolabs, (NEB)) containing 40 μM strand B and 20 UT4 polynucleotide kinase (NEB) at 37° C. for 60 min in 20 μL, followedby a 25 minute incubation at 70° C., and a 5 minute incubation at 90° C.for enzyme inactivation. The phosphorylated strand B was annealed toeach respective strand A in NEB Buffer 2 (NEB). Each solution contained20 μM strand A and 20 μM strand B in a total volume of 20 μL. Thesolutions were first raised to 90° C. and cooled to 25° C. at a rate of5° C./minute (Annealing Temperature Condition). Finally, equal volumesof all 145 annealed adapters were mixed.

Example IX Design and Preparation of Spike-in and Normalization DNA

15,000 random 30 base-pair sequences were generated such that even if asequence accumulated 15 mutations, it would still be identifiable anddistinguishable from all other generated sequences. Spike-in andnormalization candidates with a maximum homopolymer length of greaterthan 3, or a GC-content less than 11 or greater than 19 were discarded.Spike-in and normalization candidates were also discarded if theyexceeded a certain degree of complementarity or sequence identity (totalmatches and maximum consecutive matches) with (1) the Illuminapaired-end sequencing primers, (2) the 3′-end of the sequencing primers,(3) the whole E. coli genome [K-12 MG1655 strain (U00096.2)], and (4)all other generated spike-in candidates in the same fashion as barcodedesign. The final population consisted of 40 spike-in and normalizationDNA candidates, of which three were chosen at random seven times(without replacement) and concatenated, with one deletion at the 60^(th)base of strand A (corresponding to the 31^(st) base of strand B) and anaddition of a single A to the end of the sequence to form seven 90-basespike-in DNA sequences and one normalization DNA sequence. Both strandsof 5′-end phosphorylated DNA oligos were ordered from IDT and wereannealed in 0.3×NEB Buffer 2 (NEB) with 50 μM of each strand using theAnnealing Temperature Gradient. All seven spike-in sequences wereligated to the barcoded adapter mixture in NEBNext Quick LigationReaction Buffer (NEB) with 6.7 μM annealed spike-in, 6.7 μM barcodedadapter, and 6 μM Quick Ligase (NEB) by incubating at 25° C. for 30 min.The product was run on a 5% polyacrylamide gel (Bio-Rad), and thetargeted band (at ˜270 bp) was removed from the gel. The gel slice wascut into small pieces and the embedded DNA was extracted into diffusionbuffer (10 mM Tris-Cl pH 8.0, 50 mM NaCl, 0.1 mM EDTA) by overnightincubation at room temperature. Then, the extracted spike-ins werepurified on a column (Zymo Research). Sequence analysis (GeneWiz)confirmed that the band contained the expected ligation product. Theconcentration of each spike-in was estimated by qPCR (Fast SYBR MasterMix, Applied Biosystems) using sequence-specific qPCR primers against aknown-concentration Y-shaped Reference DNA (below). The concentrationsof spike-ins for the second deep sequencing run were measured inparallel by digital PCR (Fluidigm) at Molecular Genetics Core Facilityof Children's Hospital Boston Intellectual and DevelopmentalDisabilities Research Center. Each spike-in was measured a total of tentimes on two separate chips (48.770).

Example IX Design and Preparation of Y-Shaped Reference DNA

From the original list of 150 barcode candidates, we chose two barcodesthat were not present in the final list of 145 used. Then weconcatenated the two barcodes with the Y-shaped adapter sequences and a90-base pair targeted sequence mimic such that the targeted sequencemimic was between the barcodes, which were between the adapters. The90-base pair targeted sequence mimic was designed the same way as thespike-in and normalization DNAs. Both strands of DNA oligos were orderedfrom IDT and their concentrations were measured by absorbance at 260 nmusing the extinction coefficient provided by IDT. The DNA oligos wereannealed in water with 5 μM of each strand using the AnnealingTemperature Gradient.

Example X E. coli RNA Preparation and cDNA Generation

E. coli [K-12 MG1655 strain (U00096.2)] was grown overnight at 30° C. inLB medium. The resulting culture was diluted 500-fold in fresh LB mediumand grown at 30° C. for 3.5 hours such that O.D. at 600 nm became0.30-0.35. 1 mL of cells was quickly killed by addition of 0.1 mL stopsolution (90% (v/v) Ethanol and 10% (v/v) Phenol). The cells werecollected by centrifugation (9,100×g, 1.5 min, room temperature),suspended in 1 mL cooled PBS (Lonza), and centrifuged again (16,000×g,1.5 min, room temperature). The supernatant was removed and the cellswere suspended in 0.1 mL of 1 mg/mL lysozyme in TE Buffer (pH 8.0)(Ambion). 0.1 mL of lysis buffer (Genosys) was added and the mixture wasvortexed for 5 s. After adding 0.2 mL of Phenol Chloroform pH 4.5(Sigma) and vortexing three times for 5 s, the mixture was centrifuged(16,000×g, 3 min, room temperature). The top layer of solution was takenand 0.15 mL of 100% 2-Propanol (Sigma) was added; the mixture was lefton ice for 30 min. The solution was centrifuged (16,000×g, 30 min, 4°C.) to precipitate the RNA. The RNA pellet was washed twice bycentrifugation (16,000×g, 5 min, 4° C.) with 0.75 mL of cold 70% (v/v)ethanol. After the second centrifugation, the supernatant was removedand the pellet was dried for 15 min at room temperature. Then, 88 μL ofwater was added and the mixture was incubated for 15 min at roomtemperature, followed by resuspension. The resulting solution was mixedwith 0.04 U/μL DNase I (NEB) in DNase I Reaction Buffer (NEB) for atotal volume of 100 μL, and the mixture was incubated at 37° C. for 30min followed by addition of EDTA (Sigma) to a final concentration of 5mM. The mixture was incubated at 75° C. for 10 min to inactivate DNaseI, followed by column purification. Ribosomal RNA was removed usingRibo-Zero rRNA Removal Kit (Gram-Negative Bacteria) (Epicentre,Illumina). From this point, the conventional Illumina protocol for mRNASequencing Sample Preparation was followed with a few modifications. Thepurified RNA was fragmented in 0.5× fragmentation buffer (Ambion) with˜500 ng RNA in a 100 μL reaction solution. The solution was incubated onice for 1 min after the fragmentation buffer was added followed by a 6min incubation at 70° C. The tube was placed on ice and incubated for 1min followed by addition of 4 μL stop solution (Ambion). The fragmentedRNA was purified with a column and eluted in 11.1 μL in water. 1 μL of50 μM Random Hexamer Primer (Applied Biosystems) was added to thissolution and incubated at 65° C. for 5 minutes and then placed on ice. 4μL 5× First Strand Buffer (Invitrogen), 2 μL 100 mM DTT (Invitrogen),0.4 μL 25 mM dNTP Mix (Applied Biosystems), and 0.5 μL RNase inhibitor(Applied Biosystems) was added to the mixture. This was incubated at 25°C. for 2 minutes, followed by the addition of 1 μL Superscript II(Invitrogen). It was then incubated at 25° C. for 10 minutes, 42° C. for50 minutes, and 70° C. for 15 minutes to synthesize the first strand ofthe cDNA and inactivate the enzyme, which was placed on ice and thenpurified on a column. The eluate from this column was used to generatethe second strand of cDNA in NEBNext Second Strand Synthesis ReactionBuffer (NEB) with 0.3 U/μL DNA polymerase I (E. coli) (NEB), 1.25 U/μLE. coli DNA Ligase, and 0.25 U/μL RNase H in an 800 μL total volumesolution at 16° C. for 2.5 hours followed by the column purification.The eluted double stranded cDNA was end-repaired in T4 DNA Ligase Buffer(NEB) with 0.4 mM Deoxynucleotide Solution Mix (NEB), 0.5 U/μL T4 DNApolymerase (NEB), 0.5 U/μL T4 Polynucleotide Kinase (NEB) in a 200 μLreaction solution by incubating at 20° C. for 30 min followed by columnpurification. The eluted end-repaired cDNA was dA-tailed in NEB 2 buffer(NEB) with NEBNext dA-tailing Reaction Buffer with 1 mM dATP (NEB) and0.3 U/μL Klenow Fragment (3→5′ exo-) in a 50 μL solution by incubatingat 37° C. for 30 min followed by column purification.

Example XI Sample-Adapter Ligation, Sequencing Sample Preparation, andSequencing

The cDNA library was ligated to the barcoded adapter mixture andconventional Illumina paired-end adapter (without phosphorothioate bond)(IDT) respectively in the NEBNext Quick Ligation Reaction Buffer (NEB)with 5.4 μL of the cDNA produced above, 1.9 μM barcoded adapter orconventional Illumina Paired-end adapter, and 3.6 μM Quick Ligase (NEB)for a total volume of 10 μL by incubating at 25° C. for 15 min. The twosolutions were separately run on a 5% polyacrylamide gel and the portionbetween 250-300 bp was cut. The gel slice was cut into small pieces andthe embedded DNA was extracted into diffusion buffer (10 mM Tris-Cl pH8.0, 50 mM NaCl, 0.1 mM EDTA) by overnight incubation at roomtemperature. Then, the extracted DNAs were column purified. Theconcentrations of purified products were measured by qPCR (Fast FastSYBR Master Mix, Applied Biosystems) against a known-concentrationY-shaped reference sequence using designed qPCR primers purchased fromIDT. The sample ligated to the barcoded adapter and the conventionalIllumina Paired-end adapter were amplified by PCR (1 cycle of 98° C. for1 min, 18 cycles of 98° C. for 1 s, 65° C. for 45 s, 72° C. for 40 s,and 1 cycle of 72° C. for 5 min) in HF buffer (NEB) with 0.63 mM dNTP,0.5 mM of each amplification primer modified from Illumina PCR primer PE1.0 and 2.0 (IDT), 25 fM DNA sample, and Phusion DNA polymerase (NEB) in20 μL, with spike-in DNAs (0.71 aM Spike-in 1, 1.0 aM Spike-in 2, 4.4 aMSpike-in 3, 18 aM Spike-in 4, 150 aM Spike-in 5, 480 aM Spike-in 6 forthe first sequencing run, and 1.3 aM Spike-in 1, 6.9 aM Spike-in 3, 36aM Spike-in 4, 150 aM Spike-in 5, 770 aM Spike-in 7 for the secondsequencing run). Then, 10 pM Normalization DNA was added to both PCRproducts, and the DNA was purified twice on a column. The concentrationof the purified product was measured by qPCR (Fast Fast SYBR Master Mix,Applied Biosystems) using the conventional Illumina qPCR primer (IDT)against a PCR product amplified from Y-shaped reference DNA usingmodified Illumina PCR primers whose concentration was measured byNanoDrop (LMS). The final concentration of each spike-in andnormalization DNA in the purified products was measured by qPCR usingsequence-specific qPCR primers described above and compared bynormalization. The length distribution of the purified PCR product wasmeasured by Bioanalyzer (Agilent). Samples with barcoded adapters weresequenced on an Illumina HiSeq 2000 with 2×100 (for the first sequencingrun) and 2×50 (for the second) base paired-end reads in one lane.

Example XII Quantification Accuracy Using Digital PCR and DigitalRNA-Seq of Spike-Ins

Spike-in Analysis.

From the raw sequencing data, reads were isolated which containedbarcode sequences that corresponded to the original 145 barcodes in bothforward and reverse reads for each sequencing cluster that had at mostone mismatch. The first 28 bases (26 bases for the second sequencingrun) of the targeted sequence of both the forward and reverse reads ofeach cluster were aligned to each Spike-in sequence, which is known.Sequences with more than two mismatches were discarded. The number ofunique tags present in each spike-in were counted to determine thenumber of copies of each spike-in.

To calibrate the digital RNA-Seq system, the concentrations of fivesynthetic DNA spike-in sequences were measured using the Fluidigmdigital PCR platform and used as internal standards. The spike-insamples were barcoded, added to the barcoded E. coli cDNA library, andquantified using the sequencing-based digital counting strategydescribed above. FIG. 11A shows that the number of digital counts (i.e.unique barcodes) observed in deep sequencing is well-correlated with thedigital PCR calibration of the spike-in sequences.

To evaluate the difference between using random barcode sequences andoptimized barcode sequences, two experiments were conducted. In oneexperiment, the spike-in molecules were labeled with random barcodesequences, and in the second experiment, the optimized, pre-determinedbarcode set was used. The histograms of the number of reads for allbarcodes observed from the most abundant spike-in sequence wereconstructed (FIG. 11B). When using random barcodes (light histogram inFIG. 11B, the left-most bin exhibits a large peak because a substantialfraction of barcodes are infrequently read due to sequencing errors.This causes barcodes to interconvert, generating quantificationartifacts. In stark contrast, the left-most bin when using optimizedbarcodes (dark histogram in FIG. 11B) has no such peak because theoptimized barcode sequences avoid misidentification due to sequencingerrors. The effect of sequencing error on both random and optimizedbarcode counting is clearly shown by simulation (FIG. 12).

The dark histogram in FIG. 11B is the distribution of the number ofreads for the 5,311 uniquely barcoded molecules from a particularspike-in. Assuming each barcoded spike-in molecule is identical, thedark histogram in FIG. 11B is essentially the probability distributionof the number of reads for a single molecule, which spans threeorders-of-magnitude. This broad distribution arises primarily fromintrinsic PCR amplification noise in sample preparation. Given thisbroad single molecule distribution, for low copy molecules in theoriginal sample, counting the total number of reads (conventionalRNA-Seq) would lead to inaccuracies. On the other hand, this problem canbe circumvented if one counts the number of different barcodes(integrated area of the histogram) using the digital RNA-Seq approach,yielding accurate quantification with single copy resolution. The twocounting schemes give same results only when the copy number in theoriginal sample is high, assuming there is no sequence-dependent bias.Random sampling of the barcode sequences by each target sequence isessential for accurate digital counting. FIG. 11C shows that thedistribution of observed molecule counts is in excellent agreement withPoisson statistics. Therefore the five spike-in sequences sample the21,025 barcode pairs without bias.

Example XIII Efficacy of Downsampling of Spike-in Reads

The spike-in reads in the replicate experiment were randomly downsampledby a factor of 10. For each read of each of the five spike-ins, therewas a 10% chance that it was kept and counted, whereas the other 90% wasdiscarded. FIG. 13A shows that there is little dropout between these twoconditions (the data show the average dropout rate for the fivespike-ins to be 1.6%) and that for the spike-in with the highest numberof molecules, the change in the single molecule coverage histogram isminimal.

Example XIV Digital Quantification of the E. coli Transcriptome

E. coli Transcriptome Analysis. Reads were isolated which containedbarcode sequences that corresponded to our original list of 145 barcodesin both forward and reverse reads for each sequencing cluster that hadat most one mismatch. The first 28 bases (26 bases for the secondsequencing run) of the targeted sequence of both the forward and reversereads of each cluster were aligned to the E. coli genome and thesequences that uniquely align fewer than three mismatches and where thetwo reads did not map to the same sense or antisense strand of thegenome were kept. The remaining sequences were mapped to transcriptionunits as described in Keseler, Nucleic Acids Res 39(Databaseissue):D583-D590 (2011) and sorted by starting and ending position aswell as forward and reverse barcodes (unique tag). Mapped sequencefragments with a length of at least 1,000 bases were discarded. Allsequences within the same transcription unit that had the same uniquetag were analyzed further. It was determined that more than one sequencewith the same unique tag were identical if the distance between theircenter positions was less than four base-pairs and if the difference inlength was less than 9 base-pairs (FIG. 14 and FIG. 15). Thus, the readcounts for sequences deemed identical were summed and the sequence withmore read counts was deemed as the actual correct sequence. Then foreach unique sequence, the number of unique barcode tags that appeared todetermine the copy number of each sequence were counted.

Comparison of Noise in Conventional vs. Digital Counting for the E. coliTranscriptome. The total number of reads and the total number of digitalcounts for each base in each of the mapped sequences were summed. Foreach transcription unit, bins were created that were 99 base-pairs longand the total number of reads and the total number of digital countspresent in each bin were summed. Bins that yielded an average number ofdigital counts per base of greater than or equal to 1 were selected andfrom these bins, the average and sample standard deviation of the summedreads and summed digital counts were counted, respectively. The noisewas defined as the sample standard deviation divided by the mean andthis value was calculated for both reads and digital counts; the ratioof the noise for reads to digital counts was computed for eachtranscription unit (FIG. 16D).

We obtained 26-32 million reads from the barcoded cDNA libraries thatuniquely mapped to the E. coli genome in two replicate experiments. FIG.16A shows the number of conventional and digital counts (uniquebarcodes) as a function of nucleotide position for the fumAtranscription unit (TU). Not surprisingly, the read density isconsiderably less uniform across the TU than the number of digitalcounts, presumably due to intrinsic noise and bias in fragmentamplification.

It is advantageous for transcripts across the E. coli transcriptome tosample all barcodes evenly. FIG. 16B shows this distribution, which isclose to Poisson but is somewhat overdispersed. Such biased samplingreduces the effective number of barcode sequences N_(eff) available.However, in the E. coli transcriptome sample, the copy number of themost abundant cDNA ranges from 10-40 copies for both counting methods.Based on Poisson statistics, even for the most abundant cDNA fragmentsin our sample, the required N_(eff) is ˜100-400 for 95% unique labelingof all molecules. Because there are 21,025 barcode pairs available, onaverage the degree of randomness observed in FIG. 16B is sufficient.

Conventional methods count the number of amplicons, a quantity that issubject to bias and intrinsic amplification noise, rather than thenumber of molecules in the original sample. Conversely, in the digitalcounting scheme described herein, unique barcode sequences distinguisheach molecule in the sample, and so the effects of intrinsic noise areminimized, especially when an optimized barcode set is used. FIG. 16Cshows how drastically different digital counting can be fromconventional counting at low copy numbers, implying that digitalcounting of unique barcodes is advantageous, particularly forquantifying low copy fragments. The correlation is stronger for highcopy fragments and the same phenomenon is also observed for whole TUsand genes (FIG. 17).

To demonstrate the superior accuracy of digital counting, the uniformityof abundance measurements within individual transcripts was examined.Because individual TUs were, by-and-large, intact RNA moleculesfollowing RNA synthesis, the cDNA fragments that map to one region of agiven TU should have the same abundance as fragments that map to adifferent region of the same TU. The ratio between the variation inconventional counting v_(C) and variation in digital counting v_(D) forTUs in different abundance ranges were histogrammed (FIG. 16D). Avariation ratio of v_(C)/v_(D)=1 indicates that both conventional anddigital counting give similarly uniform abundances along the length of aTU. For a TU where v_(C)/v_(D) exceeds one, conventional countingmeasures abundance less consistently along the TU than digital counting.The mean values of v_(C)/v_(D) in the two replicates are 1.4 (s=1.5,where s is sample standard deviation) and 1.2 (s=0.5) for the completeset of analyzed TUs, indicating that conventional counting is lessconsistent than digital counting across an average TU. Furthermore, themean value of v_(C)/v_(D) increases with decreasing copy number and itsdistribution becomes broader (FIG. 16D). For TUs in the lowest abundanceregime, the mean values of v_(C)/v_(D) are 1.9 (s=2.4) and 1.3 (s=0.9)for the two replicates. On average, digital counting outperformsconventional counting in terms of accuracy, and its performanceadvantage is most pronounced for low abundance TUs.

While FIG. 16 demonstrates that digital counting is less noisy and moreaccurate than conventional counting, FIG. 18 shows that digital countingis also more reproducible. This is demonstrated on the level of a singleTU in FIG. 18A, which shows the ratio of counts between the tworeplicates for both conventional and digital counting along the fumAtranscript. This ratio is consistently close to one for digitalcounting, but fluctuates over three orders-of-magnitude for conventionalcounting. We analyzed the global reproducibility of the wholetranscriptome for quantification of TUs and genes for both conventionaland digital counting in FIG. 18B and FIG. 18C, respectively. In bothcases, the correlation between replicates is noticeably better fordigital counting than conventional counting, particularly for low copytranscripts.

FIG. 19 is a simulation demonstrating the advantageous performance ofdigital counting methods described herein over conventional counting indifferential expression analysis. RNA expression quantification wassimulated using experimentally measured copy numbers, barcode sampling,and amplification noise distributions for two different libraries foreach of three different systems (E. coli transcriptome fragments in FIG.19A, E. coli transcription units in FIG. 19B, and human stomach microRNAin FIG. 19C. The ratio of simulated to actual fold-change for each geneas a function of the lower of two copy numbers for the two comparedlibraries is plotted. Ideally, the value of this ratio is one for allgenes. Because digital counting is almost completely immune toamplification noise, it gives consistently superior performance toconventional counting for differential expression, even at low copynumbers. The discrepancy between conventional and digital counting issmaller for the E. coli transcription unit library in (B) than for thefragment library in (A) because amplification noise can be averaged overmany fragments in the case of long transcription units.

Example XV Digital Quantification of Copy Numbers of Different Sequencesat Different Concentrations

According to an aspect of the present disclosure, methods are providedfor counting copy numbers of different sequences that are present in asample at different concentrations using unique barcodes. Unlikeconventional methods for counting molecules with very different copynumbers, the methods described herein using unique barcodes allows theentire sample to be processed in a single tube.

As a precursor, consider the conventional method which uses a dilutionscheme requiring many tubes. In this implementation of molecularcounting, one is able to combine sample dilutions by attaching the samebarcode to all molecules in the same tube for all tubes to achievedigital counting. Consider the Path A in FIG. 20 in which a samplecontaining two species (DNA 1 and DNA 2) which have copy numbers of 1111and 11, respectively. After performing order-of-magnitude serialdilutions several times, the sample is now split into five differenttubes, where each sequential tube contains 10-fold fewer copies of eachDNA than the previous tube (Path A). The tubes are numbered from theleft as tube 0, tube 1, etc. During the serial dilution process, at somepoint a given tube (tube 2) will no longer contain any copies of DNA2but will still contain 10 copies of DNA1 due to a much higher number ofDNA1 molecules in the original sample. The final tube (tube 4) has beendiluted such that both DNA1 and DNA2 are no longer present. Allmolecules in a given tube are labeled with the same barcode sequence andthe procedure repeated for all tubes such that each different tube has aunique barcode sequence which maps to the specified tube. For example,all DNA molecules from tube 0 will be labeled by a specific barcode,whereas all DNA molecules from tube 1 will be labeled by a differentbarcode as illustrated.

After this process, all molecules from all tubes are combined together.Despite mixing together the DNA from all tubes, the barcoding schemestill allows tracking of which tube a given molecule is originally from.The combined sample will then be subjected to PCR amplification and thenmassively parallel sequencing; conventional PCR can easily beimplemented by designing barcodes which contain the amplification primersequences. After sequencing, four types of barcodes are detected forDNA1: barcode 1 (tube 0), barcode 2 (tube 1), barcode 3 (tube 2), andbarcode 4 (tube 3). However, the barcode 5 representing dilution by 10⁴(tube 4) does not appear. This suggests that the copy number of DNA1originally present is at least (and is the same order-of-magnitude as)10⁰+10¹+10²+10³=1,111, which is the same as the known original copynumber of DNA1. Similarly, only two types of barcodes are detected forDNA2: barcode 1 (tube 0) and barcode 2 (tube 1), which suggests that thecopy number of DNA2 originally present is at least (and is the sameorder-of-magnitude as) 10⁰+10¹=11.

The same process may be applied to the same sample using only one tubeand without any serial dilutions by utilizing stochastic labeling of DNAmolecules by barcodes. Consider now the same original tube in FIG. 20which contains 1111 DNA1 molecules and 11 DNA2 molecules. Instead ofmaking serial dilutions, we will instead add a barcode set to theoriginal tube (Path B), which contains five barcodes in the ratio10,000:1,000:100:10:1. By adding the entire barcode set to the originalsample tube, one would expect that labeling of all copies of both DNAmolecules by said barcodes would also follow the same ratio. Afterbarcode labeling and PCR amplification, the original tube is subjectedto massively parallel sequencing, and the results are exactly the sameas in Path A; there were at least 1111 DNA1 molecules and at least 11DNA2 molecules. In effect, this labeling procedure is identical to thedilution protocol presented above. According to method B, digitalcounting is performed in a single tube by using a small number ofbarcodes (only one per order-of-magnitude) on a large number of uniqueDNA molecules with highly varying copy numbers. Accordingly a method isprovided of determining the copy number of different nucleic acids in asample by adding to the sample a set number of barcodes in a ratio toone another, allowing the barcodes to attach to the nucleic acidsaccording to the ratio and, sequencing the nucleic acids to determinethe copy number. Accordingly a method is provided of determining thecopy number of different nucleic acids in a sample by attaching barcodesto the nucleic acids where the barcodes have an order of magnitude ratioto one another and, sequencing the nucleic acids to determine the copynumber. The methods of attaching barcodes to nucleic acids in a sample,amplifying and sequencing as describe herein can be used with themethods of this Example XV.

Both methods A and B have probabilistic uncertainty and theseuncertainties are essentially the same. One can decrease thisuncertainty dramatically and as a result accurately count all moleculeson average by performing multiple replicate experiments. A considerationof this technique is the copy number of the highest copy barcode, whichmust be higher than the copy number of the most abundant DNA species inthe original tube. Otherwise, there may not be enough resolution todetermine the exact copy number of the higher copy templates. Althoughrelatively low resolution is required for the example in FIG. 20, as theratio of barcode copy numbers is separated by an order of magnitudeeach, it is very plausible to have many more barcodes each separated bya factor of two or less. Although this would increase the number ofbarcodes required, precision is also increased. To cover a 10⁶-foldrange, one only needs 20 barcodes, assuming the case of two-foldseparation. With this method, one can easily determine the copy numberof multiple nucleic acid templates within a single tube withoutexcessive or complicated sample preparation protocols.

What is claimed is:
 1. A method of determining copy number of a nucleicacid molecule in a sample including a plurality of nucleic acidmolecules comprising attaching a unique barcode sequence tosubstantially each of the plurality of nucleic acid molecules in thesample to produce a plurality of barcoded nucleic acid molecules,amplifying the plurality of barcoded nucleic acid molecules in thesample to produce amplicons of the plurality of barcoded nucleic acidmolecules, sequencing each amplicon to identify an associated nucleicacid sequence and an associated barcode sequence, selecting a firsttarget nucleic acid sequence and determining the number of uniqueassociated barcode sequences for the first target nucleic acid sequence,wherein the number of unique associated barcode sequences is the copynumber of the first target nucleic acid sequence.
 2. The method of claim1 wherein the nucleic acid molecules are DNA or RNA.
 3. The method ofclaim 1 wherein the plurality of barcoded nucleic acid molecules is aplurality of barcoded RNA molecules and further including the steps ofreverse transcribing the plurality of barcoded RNA molecules to producebarcoded cDNA molecules and amplifying the barcoded cDNA molecules toproduce amplicons of the barcoded cDNA molecules.
 4. The method of claim3 comprising repeatedly reverse transcribing the plurality of barcodedRNA molecules to produce linear pre-amplified barcoded cDNA moleculesand amplifying the linear pre-amplified barcoded cDNA molecules toproduce amplicons of the linear pre-amplified barcoded cDNA molecules.5. The method of claim 4 wherein the step of repeatedly reversetranscribing the plurality of barcoded RNA molecules includes usingreverse transcriptase and a nicking enzyme.
 6. The method of claim 1wherein the plurality of barcoded nucleic acid molecules is a pluralityof barcoded DNA molecules and further including the steps of repeatedreplication of the plurality of barcoded DNA molecules to produce aplurality of pre-amplified barcoded DNA molecules and amplifying theplurality of pre-amplified barcoded DNA molecules to produce ampliconsof the plurality of pre-amplified barcoded DNA molecules.
 7. The methodof claim 6 wherein the step of repeated replication of the plurality ofbarcoded DNA molecules includes using DNA polymerase and a nickingenzyme.
 8. The method of claim 1 wherein the sample is obtained from oneor more cells of a first cell type and wherein amplification includesuse of a primer generated from genomic DNA of the first cell type.
 9. Amethod of counting nucleic acid molecules in a sample including aplurality of nucleic acid molecules comprising attaching a uniquebarcode sequence to substantially each of the plurality of nucleic acidmolecules in the sample to produce a plurality of barcoded nucleic acidmolecules, amplifying the plurality of barcoded nucleic acid moleculesin the sample to produce amplicons of the plurality of barcoded nucleicacid molecules, sequencing each amplicon to identify an associatedbarcode sequence, and counting the number of unique associated barcodesequences as a measure of the number of nucleic acid molecules in thesample.
 10. The method of claim 9 wherein the nucleic acid molecules areDNA or RNA.
 11. The method of claim 9 wherein the plurality of barcodednucleic acid molecules is a plurality of barcoded RNA molecules andfurther including the steps of reverse transcribing the plurality ofbarcoded RNA molecules to produce barcoded cDNA molecules and amplifyingthe plurality of barcoded cDNA molecules to produce amplicons of theplurality of barcoded cDNA molecules.
 12. The method of claim 11comprising repeatedly reverse transcribing the plurality of barcoded RNAmolecules to produce linear pre-amplified barcoded cDNA molecules andamplifying the linear pre-amplified barcoded cDNA molecules to produceamplicons of the linear pre-amplified barcoded cDNA molecules.
 13. Themethod of claim 12 wherein the step of repeatedly reverse transcribingthe plurality of barcoded RNA molecules includes using reversetranscriptase and a nicking enzyme.
 14. The method of claim 9 whereinthe plurality of barcoded nucleic acid molecules is a plurality ofbarcoded DNA molecules and further including the steps of repeatedreplication of the plurality of barcoded DNA molecules to produce aplurality of pre-amplified barcoded DNA molecules and amplifying theplurality of pre-amplified barcoded DNA molecules to produce ampliconsof the plurality of pre-amplified barcoded DNA molecules.
 15. The methodof claim 14 wherein the step of repeated replication of the plurality ofbarcoded DNA molecules includes using DNA polymerase and a nickingenzyme.
 16. The method of claim 9 wherein the sample is obtained fromone or more cells of a first cell type and wherein amplificationincludes use of a primer generated from genomic DNA of the first celltype.
 17. A method of determining copy numbers of nucleic acid moleculesin a sample comprising attaching a unique barcode sequence tosubstantially each of the nucleic acid molecules in the sample toproduce a plurality of barcoded nucleic acid molecules, amplifying theplurality of barcoded nucleic acid molecules in the sample to produceamplicons of the plurality of barcoded nucleic acid molecules, massivelyparallel sequencing the amplicons of the plurality of barcoded nucleicacid molecules to identify for each amplicon an associated nucleic acidsequence and an associated barcode sequence, and determining the numberof unique associated barcode sequences for each nucleic acid sequence inthe sample.
 18. The method of claim 17 wherein the nucleic acidmolecules are DNA or RNA.
 19. The method of claim 17 wherein theplurality of barcoded nucleic acid molecules is a plurality of barcodedRNA molecules and further including the steps of reverse transcribingthe plurality of barcoded RNA molecules to produce barcoded cDNAmolecules and amplifying the plurality of barcoded cDNA molecules toproduce amplicons of the plurality of barcoded cDNA molecules.
 20. Themethod of claim 19 comprising repeatedly reverse transcribing theplurality of barcoded RNA molecules to produce linear pre-amplifiedbarcoded cDNA molecules and amplifying the linear pre-amplified barcodedcDNA molecules to produce amplicons of the linear pre-amplified barcodedcDNA molecules.
 21. The method of claim 20 wherein the step ofrepeatedly reverse transcribing the plurality of barcoded RNA moleculesincludes using reverse transcriptase and a nicking enzyme.
 22. Themethod of claim 17 wherein the plurality of barcoded nucleic acidmolecules is a plurality of barcoded DNA molecules and further includingthe steps of repeated replication of the plurality of barcoded DNAmolecules to produce a plurality of pre-amplified barcoded DNA moleculesand amplifying the plurality of pre-amplified barcoded DNA molecules toproduce amplicons of the plurality of pre-amplified barcoded DNAmolecules.
 23. The method of claim 22 wherein the step of repeatedreplication of the plurality of barcoded DNA molecules includes usingDNA polymerase and a nicking enzyme.
 24. The method of claim 17 whereinthe sample is obtained from one or more cells of a first cell type andwherein amplification includes use of a primer generated from genomicDNA of the first cell type.